Algorithms for Molecular Biology Fall Semester, 1998
Lecture 3: Decemb er 27, 1998
Lecturer: Prof. Ron Shamir Scribe: Micky Frankel and Yossi Richter
3.1 Intro duction - Sequence Alignment Heuristics
In the second lecture we presented dynamic programming algorithms to calculate the b est
7
alignmentoftwo strings. When we are searching, in a database of size 10 , for the closest
match to a query string of length 200-500, we can not use these algorithms b ecause they
require to o much time. There are several approaches to overcome this problem:
1. Implementing the dynamic programming algorithms in hardware, thus executing them
much faster.
2. Using parallel hardware, the problem can b e divided eciently to a numb er of pro ces-
sors, and the results can b e integrated later.
3. Using di erent heuristics that work much faster than the original dynamic program-
ming algorithm.
We next present some of the most commonly used heuristics.
3.2 FASTA
The FASTA algorithm is a heuristic metho d for string comparison. It was develop ed by
Lipman and Pearson in 1985 [6] and further improved in 1988 [7].
FASTA compares a query string against a single text string. When searching the whole
database for matchestoagiven query,we compare the query using the FASTA algorithm
to every string in the database.
When lo oking for an alignment, we might exp ect to nd a few segments in whichthere
will b e absolute identitybetween the two compared strings. The algorithm is using this
prop erty and fo cuses on these identical regions. 1
c
2 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98
The stages in the FASTA algorithm are as follows:
1. We sp ecify an integer parameter called k tup (short for krespective tuples), and welook
for k tup-length matching substrings of the two strings. The standard recommended
ktup values are six for DNA sequence matching and two for protein sequence matching.
The matching k tup-length substrings are referred to as hot spots. Consecutive hot sp ots
are lo cated along the dynamic programming matrix diagonals. This stage can b e done
eciently by using a lo okup table or a hash to store all the ktup-length substrings from
one string, and then search the table with the k tup-length substrings from the other
string.
2. In this stage we wish to nd the 10 b est diagonal runs of hot sp ots in the matrix. A
diagonal run is a sequence of nearby hot sp ots on the same diagonal (not necessarily
adjacent along the diagonal, i.e., spaces b etween these hot sp ots are allowed). A run
need not contain all the hot sp ots on its diagonal, and a diagonal maycontain more
than one of the 10 b est runs we nd.
In order to evaluate the diagonal runs, FASTA gives each hot sp ot a p ositive score,
and the space b etween consecutive hot sp ots in a run is given a negative score that
decreases with the increasing distance. The score of the diagonal run is the sum of the
hot sp ots scores and the intersp ot scores. FASTA nds the 10 highest scoring diagonal
runs under this evaluating scheme.
3. A diagonal run sp eci es a pair of aligned substrings. The alignment is comp osed of
matches (the hot sp ots) and mismatches (from the intersp ot regions), but it do es not
contain any indels b ecause it is derived from a single diagonal. We next evaluate
the runs using an amino acid (or nucleotide) substitution matrix, and pick the b est
scoring run. The single b est subalignment found in this stage is called init . Apart
1
from computing init , a ltration is p erformed and we discard of the diagonal runs
1
achieving relatively low scores.
4. Until nowwe essentially did not allowany indels in the subalignments. Wenow try to
combine \go o d" diagonal runs from close diagonals, thus achieving a subalignment with
indels allowed. Wetake \go o d" subalignments from the previous stage (subalignments
whose score is ab ove some sp eci ed cuto ) and attempt to combine them into a single
larger high-scoring alignment that allows some spaces. This can b e done in the following
way:
We construct a directed weighted graph whose vertices are the subalignments found
in the previous stage, and the weight in eachvertex is the score found in the previous
stage of the subalignment it represents. Next, we extend an edge from vertex u to
vertex v if the subalignment represented by v starts at a lower row than where the
3.3. BLAST - BASIC LOCAL ALIGNMENT SEARCH TOOL 3
subalignment represented by v ends. We give the edge a negativeweight which dep ends
on the numb er of gaps that would b e created by aligning according to subalignment v
followed by subalignment u. Essentially,FASTA then nds a maximum weight path in
this graph. The selected alignment sp eci es a single lo cal alignmentbetween the two
strings. The b est alignment found in this stage is marked init . As in the previous
n
stage, we discard alignments with relatively low score.
5. In this step FASTA computes an alternative lo cal alignment score, in addition to init .
n
Recall that init de nes a diagonal segment in the dynamic programming matrix. We
1
consider a narrow diagonal band in the matrix, centered along this segment. We observe
that it is highly likely that the b est alignmentpathbetween the init substrings, lies
1
within the subtable de ned by the band. We assume this is the case and compute
the optimal lo cal alignment in this band, using the ordinary dynamic programming
algorithm. Assuming that the b est lo cal alignment is indeed within the de ned band,
the lo cal alignment algorithm essentially merges diagonal runs found in the previous
stages to achieve a lo cal alignment whichmaycontain indels. The band width is
dep endent on the ktup choice. The b est lo cal alignment computed in this stage is
called opt.
6. In the last stage, the database sequences are ranked according to init scores or opt
n
scores, and the full dynamic programming algorithm is used to align the query sequence
against each of the highest ranking result sequences.
Although FASTA is a heuristic, and as such it is p ossible to show instances in which
the alignments found by the algorithm are not optimal, it is claimed (and supp orted by
exp erience) that the resulting alignment scores well compare to the optimal alignment, while
the FASTA algorithm is much faster than the ordinary dynamic programming alignment
algorithm.
3.3 BLAST - Basic Lo cal Alignment SearchTool
The BLAST algorithm was develop ed by Altschul, Gish, Mil ler, Myers and Lipman in 1990 [1]
The motivation to the development of BLAST was the need to increase the sp eed of FASTA
by nding fewer and b etter hot sp ots during the algorithm. The idea was to integrate the
substitution matrix in the rst stage of nding the hot sp ots.
BLAST concentrates on nding regions of high lo cal similarity in alignments without
gaps, evaluated by an alphab et-weight scoring matrix. We next de ne some fundamental
ob jects concerning BLAST:
Given two strings S and S ,a segment pair is a pair of equal length resp ective substrings
1 2
of S and S , aligned without spaces.
1 2
c
4 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98
A local ly maximal segment pair is a segment pair whose alignment score (without spaces)
can not b e improved by extending it or shortening it.
A maximal segment pair (MSP) in S and S is a segment pair with the maximum score
1 2
over all segment pairs in S , S .
1 2
When comparing all the sequences in the database against the query, BLAST attempts
to nd all the database sequences that when paired with the query contain an MSP ab ove
some cuto score S .Wecho ose S such that it is unlikely to nd a random sequence in the
database that achieves a score higher than S when compared with the query sequence.
The stages in the BLAST algorithm are as follows:
1. Given a length parameter w and a threshold parameter t, BLAST nds all the w -length
substrings (called words) of database sequences that align with words from the query
with an alignment score higher than t. Each such hot sp ot is called a hit in BLAST.
2. We extend each hit to nd out if it is contained in a segment pair with score ab ove S .
Wemay implement the rst stage by constructing, for each w -length word in the query
sequence, all the w -length words whose similarityto is at least t.We store these words in
a data structure which is later accessed while checking the database sequences.
It is usually recommended to set the parameter w to values of 3 to 5 for amino acids,
and 12 for nucleotides.
Although BLAST do es not allow alignments with indels, it has b een shown that with the
correct selection of values to the parameters used by the algorithm, it is p ossible to obtain all
the correct alignments while saving much of the computation time compared to the standard
dynamic programming metho d.
3.3.1 Improved BLAST
Altschul et al. suggested in 1997 [2] an improved BLAST algorithm that allows indels in the
alignment.
The algorithm stages follows:
1. When considering the dynamic programming matrix to align two strings, we search
along each diagonal for two w -length words such that the distance b etween them is
A and their score is T . Whenever we encounter such a pair of hits we concatenate
them.
2. In the second stage wewant to allow lo cal alignments with indels, similarly to the
FASTA algorithm. We allowtwo lo cal alignments from di erent diagonals to merge
3.4. AMINO ACIDS SUBSTITUTION MATRICES 5
into a new lo cal alignment comp osed of the rst lo cal alignmentfollowed by some
indels and then the second lo cal alignment. This lo cal alignment is essentially a path
in the dynamic programming matrix, comp osed of two diagonal sections and a path
connecting them whichmay contain gaps. UnlikeinFASTA, where we only allowed
the diagonal to have lo cal shifts, restricted to a band, here we allow lo cal alignments
from di erent diagonals to merge as long as the resulting alignment has a score ab ove
some threshold. This metho d results in an alignment structure whichismuch less
regular.
3.3.2 PSI BLAST - Position Sp eci c Iterated BLAST
The PSI BLAST is another improved version of the BLAST algorithm [2].
When aligning a group of amino acid sequences, i.e., writing them one b elow the other
(see section 3.5 for discussion of multiple alignment), the vector of characters in a certain
column (i.e. the same p osition in the aligned sequences) is called a pro le.For a certain
pro le wemay compute the histogram of characters typ es and obtain the variance b etween
them. When we align together amino acid sequences b elonging to the same protein family,
we will nd that some regions are very similar, with pro les showing little variance. These
regions, called conservedregions, de ne the structure and functionalitytypical to this family.
Wewould like the substitution matrices we use to takeinto account the statistic information
wehave ab out how conserved is the column, in order to improve our alignment score.
In the rst stage in the algorithm we p erform ordinary BLAST while using a di erent
cost vector V for each column i. Initially, each suchvector V is set to the rowofthe
i i
substitution matrix corresp onding to the i-th character in the query sequence. From the
high-scoring results we get, we build the pro les for each column. We continue to p erform
BLAST iteratively while using as query the collection of pro les, i.e. we use a histogram
at each column rather than a simple string, and compare it against the database. This is
equivalent to up dating the p osition dep endent cost vectors according to the pro le statistics.
After each iterativestepwe up date the pro les according to the obtained result sequences.
We terminate the iterative lo op when we no longer nd new meaningful matches.
It should b e noted that biologists regard PSI BLAST as not reliable.
3.4 Amino Acids Substitution Matrices
When we search for the b est alignmentbetween two protein sequences, the scoring (or sub-
stitution) matrix we use can largely a ect the results we get. Ideally, the scores should re ect
the biological phenomena that the alignment seeks to exp ose. For example - in the case of
sequence divergence due to evolutionary mutations, the values in the scoring matrix should
ideally b e derived from empirical observation on ancestral sequences and their present-day
c
6 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98
descendants.
Two examples of simple substitution matrices often used that do not employ the biological
phenomena are:
1. The unit matrix:
(
1 if i = j;
M =
ij
0 otherwise :
2. The genetic code matrix. M equals the numb er of minimal base substitution needed
ij
to convert a co don of amino acid i to a co don of amino acid j . We disregard here
the imp ortance of chemical prop erties of the amino acids, that evidently in uence the
chances for their substitution, like their hydrophobicity,charge or size.
Amino Acids Substitution Matrices 7
3.4.1 PAM units and PAM matrices
The metho dology of PAM units and the rst sp eci c PAM matrices were develop ed by
Margaret Dayho and al. [8, 4 ]
PAM units
We use PAM units to measure the amountofevolutionary distance b etween two amino acid
sequences. Two strings S and S are said to b e one PAM unit diverged if a series of accepted
1 2
p ointmutations (and no insertions of deletions) has converted S to S with an average of
1 2
one accepted p oint-mutation event p er 100 amino acids. The term \accepted" here means a
mutation that was incorp orated into the protein and passed to its progeny. Therefore, either
the mutation did not change the function of the protein or the change in the protein was
b ene cial to the organism.
Note that two strings which are one PAM unit diverged do not necessarily di er in one
p ercent, as often mistakenly thought, b ecause a single p osition may undergo more than a
single mutation. The di erence b etween the two notions grows as the numb er of units do es.
There are two main problems with the notion of the PAM units:
1. First, practically all the sequences we can obtain to day are extracted from extant or-
ganisms. We almost do not knowany protein sequences where one is actually derived
from the other. The lack of ancestral protein sequences is handled by assuming that
amino acid mutations are reversible and equally likely in either direction. This assump-
tion, together with the additivity prop erty of the PAM units derived from its de nition,
imply that given two amino acid sequences: S and S whose mutual ancestor is S we
i j ij
have:
d(S ;S )=d(S ;S )+d(S ;S )
i j i ij ij j
when d(i; j ) is the PAM distance b etween amino acid sequences i and j .
2. The second problem, which is more dicult to overcome, is that we disregard here
insertions and deletions whichmay o ccur during evolution, hence we can not b e sure
of the correct corresp ondence b etween sequence p ositions. In order to know the exact
corresp ondence one has to b e able to identify the true historical gaps, or, at least to
identify large intervals along the two sequences where the corresp ondence is correct.
This can not always b e done with certainty, esp ecially when the two sequences are
distantly diverged.
c
8 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98
PAM matrices
PAM matrices are amino acid substitution matrices that enco de the exp ected evolutionary
change at the amino acid level. EachPAM matrix is designed to compare two sequences
which are a sp eci c number of PAM units apart. For example - the PAM120 score matrix is
designed to compare b etween sequences that are 120 PAM units apart: The score it gives a
pair of sequences is the (log of the) probabiliti es of such sequences evolving during 120 PAM
units of evolution. For any sp eci c pair (A ;A ) of amino acids the (i; j )entry in the PAM n
i j
matrix re ects the frequency at which A is exp ected to replace with A in two sequences that
i j
are n PAM units diverged. These frequencies should b e estimated by gathering statistics on
replaced amino acids.
Collecting statistics ab out amino acids substitution in order to compute the PAM ma-
trices is relatively dicult for sequences that are distantly diverged, as mentioned in the
previous section. But for sequences that are highly similar, i.e., the PAM divergence dis-
tance b etween them is small, nding the p osition corresp ondence is relatively easy since only
few insertions and deletions to ok place. Therefore, in the rst stage statistics were collected
from aligned sequences that were b elieved to b e approximately one PAM unit diverged and
the PAM1 matrix could b e computed based on this data, as follows: Let M denote the
ij
observed frequency (= estimated probability) of amino acid A mutating into amino acid A
i j
during one PAM unit of evolutionary change. M is a 20 20 real matrix, with the values
in each matrix column adding up to 1. There is a signi cantvariance b etween the values in
each column. For example, see gure 3.1, taken from [4].
A R N D C
A 9867 2 9 10 3
R 1 9913 1 0 1
N 4 1 9822 36 0
D 6 0 42 9859 0
C 1 1 0 0 9973
4
Figure 3.1: The top left corner 5 5 of the PAM1 matrix. We write 10 M for convinience.
ij
n
Once M is known, the matrix M gives the probabilities of any amino acid mutating to
any other during n PAM units. The (i; j )entry in the PAM n matrix is therefore:
n n
f (j )M (i; j ) M (i; j )
log =log
f (i)f (j ) f (i)
where f (i) and f (j ) are the observed frequencies of amino acids A and A resp ectively.
i j
This approach assumes that the frequencies of the amino acids remain constantover time,
Amino Acids Substitution Matrices 9
and that the mutational pro cesses causing substitutions during an interval of one PAM unit
op erate in the same manner for longer p erio ds. We take the log value of the probabilityin
order to allow computing the total score of all substitutions using summation rather than
multiplication. The PAM matrix is usually organized by dividing the amino acids to groups
of relatively similar amino acids and all group memb ers are lo cated in consecutive columns
in the matrix.
3.4.2 BLOSUM - BLOcks SUbstitution Matrix
The BLOSUM matrix is another amino acid substitution matrix, rst calculated by Heniko
and Heniko [5]. For its calculation only blo cks of amino acid sequences with small change
between them are considered. These blo cks are called conserved blocks (See gure 3.2). One
reason for this is that one needs to nd a multiple alignmentbetween all these sequences and
it is easier to construct such an alignment with more similar sequences. Another reason is
that the purp ose of the matrix is to measure the probability of one amino acid to change into
another, and the change b etween distant sequences may include also insertions and deletions
of amino acids. Moreover, we are more interested in conservation of regions inside protein
families, where sequences are quite similar, and therefore we restrict our examination to
such.
AABCDA...BBC DA
DABCDA.A.BBC BB
BBBCDABA.BCC AA
AAACDAC.DCBC DB
CCBADAB.DBBD CC
AAACAA...BBC CC
Figure 3.2: Alignment of several sequences. The conserved blo cks are marked.
The rst stage of building the BLOSUM matrix is eliminating sequences, whichare
identical in more than x% of their amino acid sequence. This is done to avoid bias of the
result in favor of a certain protein. The elimination is done either byremoving sequences
from the blo ck, or by nding a cluster of similar sequences and replacing it by a new sequence
that represents the cluster. The matrix built from blo cks with no more the x% of similarity
is called BLOSUM-x (e.g. the matrix built using sequences with no more then 50% similarity
is called BLOSUM-50.)
The second stage is counting the pairs of amino acids in each column of the multiple
alignment. For example in a column with the acids AABACA (as in the rst column in the
c
10 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98
blo ck in gure 3.2), there are 6 AA pairs, 4 AB pairs, 4 AC, and one BC. The probability
q for a pair of amino acids in the same column to b e A and A is calculated, as well as
i;j i j
the probability p of a certain amino acid to b e A .
i i
q
i;j
. As nal result we In the third stage the logodd ratio is calculated as s =log
i;j
2
p p
i j
consider the rounded 2s ,thisvalue is stored in the (i; j )entry of the BLOSUM-x matrix.
i;j
In contrast to the PAM matrices, more sequences are examined in the pro cess of com-
puting the BLOSUM matrix. Moreover, the sequences are of sp eci c nature of resemblance,
and therefore the two sets of matrices di er.
Comparing the eciency of two matrices is done by calculating the ratio b etween the
numb er of pairs of similar sequences discovered by a certain matrix but not discovered by
another one and the numb er of pairs missed by the rst but found by the other. According
to this comparison BLOSUM-62 is found to b e b etter than other BLOSUM-x matrices as
well as than PAM-x matrices.
3.5 Multiple Alignment
De nition A multiple alignment of strings S ;S ;:::;S is a series of strings with blanks
1 2 k
0 0 0
such that ;:::;S ;S S
k 2 1
0 0 0
1. jS j = jS j = = jS j
1 2 k
0
2. S is extension of S , obtained by insertion of blanks.
j
j
AC..BCDB
.CADB.D.
ACA.BCD.
Figure 3.3: A multiple alignmentofACBC BD, C ADDB and AC AB C D .
We are interested in nding a common alignment of several sequences, b ecause this
multiple similarity suggests a common structure of the protein pro duct, a common function
or a common evolutionary source. Amultiple alignment carries more information than a
pairwise one, as a protein can b e matched against a family of proteins instead of only against
another one.
The b est multiple alignmentof r sequences is calculated using an r -dimensional hyp er-
cub e D , de ning D (j ;j ;:::;j ) to b e the b est score for aligning the pre xes of lengths
1 2 r
Multiple Alignment 11
j ;j ;:::;j of the sequences x ;x ;:::;x , resp ectively.
1 2 r 1 2 r
We de ne
D (0; 0;::: ;0)=0
And we calculate
n
)g D (j ;j ;:::;j )=min fD (j ;j ;:::;j )+( x ;:::; x
1 2 r 2f0;1g ;6=0 1 1 2 2 r r 1 j r j
r
1
Q
r
where is the cost function. The size of the hyp er-cub e is O ( n ), where n is the length
j j
j =1
r
of x , where computation of each of eachentry consider 2 1 others.
j
r
If n = n = = n = n, the space complexityisofO (n ) and the time complexityisof
1 2 r
r r
O (2 n ).
There are several known useful p ossibiliti es for measuring the divergence of a set of
aligned strings, namely the total distance b etween them.
Distancefrom Consensus - The consensus of an alignment is a string of the most
common character in each column of the alignment. The total distance b etween the
strings is de ned as the number of characters that di er from the consensus character
of their column.
Evolutionary Distance - The weight of the lightest evolutionary tree that can b e con-
structed from the sequences, with the weight of the tree de ned as the number of
changes b etween pairs of sequences that corresp ond to two adjacent no des in the tree,
summed over all such pairs.
Sum of Pairs - The sum of pairwise distances b etween all the pairs of sequences.
Carril lo and Lipman [3] found a heuristic metho d for accelerating the search for the b est
multiple alignment. The metho d is based on the prop erty that if the strings are relatively
similar, the alignment path would b e close to the main diagonal, therefore not all the values
in the multi-dimensional cub e need to b e calculated, wenow detail this algorithm.
Assuming an upp er b ound on cost of the b est alignment, we will discard some alignments
that are a priori known to b e more exp ensive than the b ound on the cost.
Let A b e an alignment of strings X ;x ;:::;x . Denote by A the pair of rows in A
1 2 r i;j
containing only x and x , and by c(A ) the cost of this pairwise alignment. Denote by
i j i;j
P
c(A) the total cost of A, and supp ose we de ne c(A)= c(A ). Let A b e the optimal
i;j
i 0 alignment (the one with the minimal cost), and supp ose we know that c(A ) c . Therefore, X X X 0 c c(A )= c(A )=c(A )+ c(A ) c(A )+ D (x ;x ) i j i;j u;v i;j u;v i i c 12 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98 Where D (x; y ) is the optimal score for aligning strings x and y . It follows that X 0 ) c D (x ;x ) c(A i j u;v i A is a pro jection of A on the uv -plain. By calculating D (x ;x ) for each i and j ,we can i j u;v P 0 nd B (u; v )= c D (x ;x ). i j i Now, consider a cell (i ;i ;:::;i = s;:::;i = t;:::;i ) whose pro jection to the uv -plane 1 2 u v r is (s; t). If the b est alignment A passes through this cell, then its pro jection A passes u;v (u;v ) (u;v ) through (s; t), and its cost c(A ) agrees with best c(A ) B (u; v ) where best s;t s;t u;v u;v is an upp er b ound on the optimal score for an alignment through (s; t)inthe uv -plain. We can compute such an upp er b ound as: (u;v ) ) ;x :::x best = D (x x :::x ;x x :::x )+d(x ;x )+D (x :::x v;j +1 v;n u;1 u;2 u;i 1 v;1 v;2 v;j 1 u;i v;j u;i+1 u;n v u i;j where d( ; ) is the cost of matching the characters and . 1 2 1 2 (u;v ) Therefore if best >B(u; v ), then the b est alignment A cannot pass through the cell s;t (i ;i ;:::;i = s;:::;i = t;:::;i ) for any i ;i ;:::;i ;i ;:::;i ;i ;:::;i , and 1 2 u v r 1 2 u 1 u+1 v 1 v +1 r these cells can b e discarded from the computation. Bibliography [1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic lo cal alignment search to ol. J Mol Biol, 215:403{10, 1990. [2] et al. Altschul SF. Gapp ed blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389{40 2, 1997. [3] H. Carrillo and D. Lipmann. The multiple sequence alignment problem in biology. SIAM J. Appl. Math, 48:1073{108 2, 1988. [4] M. Dayho and R. Schwartz. Matrices for detecting distant relationship. Atlas of Protein Sequences, pages 353{358, 1979. [5] S. Heniko and J. G. Heniko . Amino acid substitution matrices from protein blo cks. Proceedings of the National Academy of ScienceUSA, 89(22):10915{1 09 19, November 1992. [6] D. Lipman and W. Pearson. Rapid and sensitive protein similarity searches. Science, 227:1435{144 1, 1985. [7] D. Lipman and W. Pearson. Improved to ols for biological sequence comparison. Proc. Natl. Academy Science, 85:2444{2448, 1988. [8] R. Schwartz M. Dayho and B. Orcutt. A mo del of evolutionary change in proteins. Atlas of Protein Sequence and Structure, 5:345{352, 1978. 13