Algorithms for Molecular Biology Fall Semester, 1998

Lecture 5: January 3, 1999

1

Lecturer: Ron Shamir Scribe: Itay Lotan and Ziv Modai

5.1 Approximation Algorithms for Multiple Sequence

Alignment

5.1.1 Problem De nition

Given a family S =(S ;:::;S )ofk sequences, such that the sequences are \similar" to

1 k

each other, wewould like to nd out the common characteristics of this family. Aligning

each pair of sequences from S separately, often do es not reveal this common information.

0 0 0

A multiple alignment of S is a new set of sequences S =(S ;:::;S ) such that:

1 k

0

 All the strings in S are of equal length. We denote this length by l .

0

 Each S was generated from S by inserting spaces.

i

i

When p erforming multiple alignment, as in the case of pairwise alignment, one wishes to

evaluate the quality of the alignmentby giving it a numeric score (see also lecture 3).

De nition 5.1 -Thesumofpairs (SP) score of a multiple alignment M is the sum of the

scores of pairwise global alignments inducedby M.

Let  (x; y ) b e our scoring function, i.e., the price of aligning the character x with the

character y ,for x; y 2  [fg.We assume that  (; )= 0,  (x; y )=  (y; x), and that

the triangle inequality  (x; y )   (x; z )+  (z; y) holds.

5.1.2 The Center Star Metho d for (SP) Alignment

In this section, we present an approximation algorithm for calculating the optimal multiple

alignment under the SP metric (see e.g. [3] [pp 348{350]). The algorithm achieves an

approximation ratio of two.

Given a multiple alignment M, let d(S ;S ) b e the score of the pairwise alignmentit

i j

P

induces on S ;S . Our target function is the SP value of M whichisd(M)  d(S ;S )

i j i j

i

(sum over all pairs of the score of the induced alignment).

1

Based in part on a scrib e by Sariel Har-Peled on Decemb er 18, 1995 and on the b o ok [3, pp 347{363] 1

c

2 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98

Problem 5.1 The SP alignment problem.

INPUT: A set of sequences S .

QUESTION: Compute a global multiple alignment M with minimum sum-of-pairs score.

We denote by D (S; Y ) the score of the optimal alignmentbetween sequences S and Y .

Figure 5.1: A generic center star for six strings, where the center string (S )is S .

c 3

De nition 5.2 We say that a tree T having the elements of S as its nodes, induces a

multiple alignment M(T ) over S , if for each edge (S; Y ) 2 E (T ) the value of the alignment

inducedbyM on S and Y is D (S; Y ) (optimal). In such a case we write M = M(T ).

P

Given S =(S ;:::;S ), let S 2S denote the elementof S , for which D (S ;S )is

1 k c i c

i6=c

minimal. We refer to S as the center of M .

c c

Approximation Algorithms for Multiple 3

De nition 5.3 We de ne the center star, T to be a star treeof k nodes, with the center

c

node labeled S and with each of the k 1 remaining nodes labeled by a distinct sequencein

c

SnfS g (see gure 5.1). The multiple alignment M of S is the multiple alignment induced

c c

by the center star, i.e., for each v 6= c, the alignment M induces an optimal pairwise

c

alignment between S and S .

c v

The Center Star Algorithm:

P

D (S ;S ) and let M = fS g. 1. Find S 2S minimizing

i t t t

i6=t

2. Add the sequences in SnfS g to M one by one so that the alignmentofevery newly

t

added sequence with S is optimal. Add spaces, when needed, to all pre-aligned se-

t

quences.

Running time analysis:

 

k

2

1. O (n ) for step 1.

2

P

k 1

2 2 0

2. O ((i  n)  n)=O (k  n ) for step 2. (Since the worst-case length of S after the

i=1 c

addition of i strings is (i +1)  n)

Lemma 5.2 For 1  i; j  k; i 6= j it holds that d(S ;S )  D (S ;S )+D (S ;S ).

i j i c c j

Proof: Since the triangle inequality holds for every single column of the alignmentby the

de nition of the scoring scheme, it also holds for entire strings by the de nition of d. There-

fore d(S ;S )  d(S ;S )+d(S ;S ). But the edges (S ;S ); (S ;S ) are edges of E (T ), thus

i j i c c j i c c j c

d(S ;S )= D (S ;S ). It follows that d(S ;S )  D (S ;S )+D (S ;S ).

i c i c i j i c c j

 

Let M denote the optimal alignmentofS . Let d (S ;S ) denote the value of alignment

i j



between S and S induced by M .

i j

Theorem 5.3

d(M ) 2(k 1)

c

 < 2



d(M ) k

.

Proof: By Lemma 5.2 it follows that

X X X

2d(M )= d(S ;S )  (D (S ;S )+D (S ;S )) = 2(k 1) D (S ;S ) (5.1)

c i j i c c j c i

i6=j i6=j i6=c

P

We de ne W to b e D (S ;S ) and we get:

c i

i6=c

2d(M )  2(k 1)W (5.2) c

c

4 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98

On the other hand, bythechoice of c it follows that:

X X X X X

 

2d(M )= d (S ;S )  D (S ;S )= D (S ;S )  W = kW: (5.3)

i j i j i j

i i

i6=j i6=j j 6=i

And nally,

2(k 1)W 2(k 1) d(M )

c

 = : (5.4)



d(M ) kW k

Theorem 5.3 implies that calculating the multiple alignment of the center star pro duces a

2(k 1)

times the value of the optimal multiple alignment with a value whichisatmostR =

k

k

4 3

alignment. For example R = , R = .

3 4

3 2

5.1.3 Multiple Alignment with Consensus

In this section we lo ok at an approximation algorithm for a multiple alignment that optimizes

a di erent score metrics - the consensus error. As b efore, we assume the existence of a

pairwise scoring scheme  satisfying the triangle inequality.



De nition 5.4 Given a set of strings S , the consensus error of a string S with respect to S

P

  

is E (S )= D (S; S ). Note that S need not bein S .

i

S 2S

i

Problem 5.4 Optimal Steiner string.

INPUT: A set of strings S

 

QUESTION: Find a string S which minimizes the consensus error E (S ) over al l possible

strings.



The Steiner string S attempts to capture the common characteristics of the set of strings

S and re ect them in a single string. We will present an approximation algorithm for the

optimal Steiner string problem with worst-case approximation ratio of 2.

Lemma 5.5 Let S have k strings, and assume that the scoring scheme  satis es the triangle



E (S )

2



inequality. Then there exists a string S 2S such that  2 < 2 (see e.g. [3] [pp



E (S ) k

349{351]).



Proof: For any S 2S:

X X

 

  

D (S; S )  [D (S; S )+D (S ;S )] = (5.5) E (S )=

i i



S 2S

S 6=S

i

i

X

    

  

(k 2)  D (S; S )+D (S; S )+ D (S ;S )= (k 2)  D (S; S )+E (S )

i



S 6=S i

5.2. MULTIPLE ALIGNMENT TO A 5



 

If wepick S 2S suchthatS is closest to S then:

X

  



E (S )= D (S ;S )  k  D (S; S ) (5.6)

i

S 2S

i

P

The center string S 2S minimizes D (S ;S ) and therefore its consensus error is

c c i

S 2S

i





smaller then the consensus error of the S (the string closest to S ). We get:

  

(k 2)  D (S ;S )+E (S ) (k 2)  D (S ;S ) 2 E (S )

c c c

  +1=2 (5.7)

  

E (S ) E (S ) k  D (S ;S ) k

c



The pro of ab ove uses the lemma 5.5 and the fact that E (S )  E (S )

c

It is worthwhile noting that Steiner string was de ned without alignment, and the only

requirement is the distance function, that satis es the triangle inequality.We will next start

discussing consensus strings that are alignment motivated.

5.1.4 Consensus Strings from Multiple Alignment

De nition 5.5 Given a multiple alignment M of a set of strings S ,theconsensus character

in column i of M is the character that minimizes the summed distance toitfrom al l the

characters in column i.Let d(i) denote that minimum sum in column i.

De nition 5.6 The consensus string S derivedfrom the alignment M is the concatenation

M

of the consensus characters for each column of M.

P

l

De nition 5.7 The alignment error of S equals d(i) where l is the number of char-

M

i=1

acters in S

M

De nition 5.8 The optimal consensus multiple alignment is a multiple alignment M of an

input set S whose consensus string S minimizes the alignment error. It can be shown that

M

the optimal consensus multiple alignment is equal to the optimal Steiner string, as de ned

in section 5.1.3.

We can use the center string (S ) for approximating the optimal multiple alignment with

c

2

an alignment error smaller than (2 ) times the optimal alignment error.

k

5.2 Multiple Alignmenttoa Phylogenetic Tree

De nition 5.9 Atree T with a distinct string label (from a set of string S ) assignedtoeach

leaf is cal ledaphylogenetic tree on S

c

6 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98

0

De nition 5.10 Given a phylogenetic tree T on S ,aphylogenetic alignment T for T is

an assignment of one string label to each internal node of T . Note that the strings assigned

to internal nodes need not be distinct and need not befrom the set S .

For example, when T is a star, cho osing the sequence to lab el its central no de is a phylogenetic

alignment.

The phylogenetic tree T is meant to represent the "established" evolutionary history of

a set of ob jects of interest, with the convention that each extant ob ject is represented at a

unique leaf of the tree. Eachedge(u; v ) represents some evolutionary history that transforms

the string at u (assuming u is the parentof v ) to the string at v .For convenience, when

denoting an edge by a pair of no des, we shall hereafter write the parent no de rst.

0

De nition 5.11 If strings S and S are assigned to the endpoints of an edge (i; j ), then the

0

edge distance of (i; j ) is de nedtobe D (S; S ).

De nition 5.12 Let M be a phylogenetic alignment for a phylogenetic tree T . The dis-

T

tance of M is given by the sum of al l the edge distances over al l the edges of T .

T

Problem 5.6 Phylogenetic alignment:

INPUT: AsetS =(S ;:::;S ) of strings and a phylogenetic tree T ,onS .

1 k

QUESTION: nd a phylogenetic alignment M with minimum distance.

T

We assume that the structure of the tree T is known to us. This usually happ ens in

practice when T was previously reconstructed using solid evolutionary data.

5.2.1 Lifted Alignment tree - a Heuristic for Phylogenetic Align-

ment

De nition 5.13 A phylogenetic alignment is cal led lifted alignment if for every internal

node v , the string assignedtov is also assigned to one of v 's children (See gure 5.2).

Trivial ly, in such a case, al l internal nodes are assigned labels from the set of leaf strings.

 L

Let T b e the optimal alignment for tree T . We will construct a lifted alignment T =

 

Lif t(T ), which is based on T , with only a limited damage to the alignment distance. Note



that this construction is only conceptual since we usually do not know T .

L  

. Initially, . We shall assign every v alabel S For eachnode v let its T lab el b e S

v v

L 

only the leaves are lab eled, and by de nition S = S for each leaf v . The lab eling pro cess

v v

successively traverses the internal no des in any order, provided that a no de is not visited

b efore any of its children. Up on visiting a no de, it is lifted i.e. lab eled by one of the lab els

of its children. Thus, the resulting phylogenetic alignment is lifted. (see gure 5.3)

Multiple AlignmenttoaPhylogenetic Tree 7

Figure 5.2: A phylogenetic tree with lifted alignment. Eachinternal no de is lab eled by one

of the strings lab eling its children.

c

8 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98

Pro cedure Lift(T : Tree)

b egin

while there exists an unlifted no de v , all of whose children have b een lifted, do :



Find a child j whose lab el S is the closest to S . Namely

j

v

For every child i of v :

 

D (S ;S )  D (S ;S )

j i

v v



Lab el S with S

j

v

end while

end

Figure 5.3: The lifting construction at no de v .Thenumb ers on the edges are the distances



from S to the lifted strings lab eling its children. On the left is the tree b efore lifting, and

v

on the right the result of the lift. After the lift one edge will have a distance of 0.

Theorem 5.7 (Jiang, Wang and Law ler, 1996 [5]) The distance of the phylogenetic align-

L 

ment T = Lif t(T ) is at most twice the distance of the optimal phylogenetic alignment



T .

Multiple AlignmenttoaPhylogenetic Tree 9

L

Proof: Let e =(v; w) b e an edge in T . Supp ose that in T , S is the lab el of v and S is the

j i

lab el of w .Ifi = j then D (S ;S ) = 0. Otherwise:

j i

  

D (S ;S )  D (S ;S )+D (S ;S )  2  D (S ;S ) (5.8)

i j j i i

v v v

The rst inequality is due to the triangle inequality, and the second follows from the lab eling

algorithm. For an edge e =(v; w), with S = S ,Let P b e the path in T from v to the leaf

w i e

lab eled S . Due to the triangle inequality

i

 

;S )  the total length of P in T (5.9) D (S

i e

v

L

Wesay that the edge e =(v; w)is blue in T if S 6= S . The distance of a lifted alignment

i j

L

T is equal to the sum of edge distances on all the blue edges in the tree.

For a blue edge e =(v; w), observe that the de nition of lifted alignment implies that along

the path P every no de except v is lab eled S , and no no de outside P is lab eled S . Hence, if

e i e i

0 0 0

0

e =(v ;w )isany other blue edge, then P and P have no edges in common. This de nes

e e

L 

a mapping from every blue edge e in T to a path P in T such that:

e

L 

 The distance in T of the edge e is at most twice the total distance in T of the edges

on P .(Follows from equations 5.8 and 5.9.

e

 L

 No edge in T is mapp ed to by more than one edge in T .

L

Therefore the total distance of T = the total distance on blue edges  2 the sum of all

 

total distances of P paths in T  2 the total distance of T . (see gure 5.4)

e

Wenow describ e how to nd the optimal lifted alignment using a

algorithm as listed b elow. But rst we de ne:

De nition 5.14 Let T be the subtreeofT rootedatnode v and S 2S.Let d(v; S ) denote

v

the distance of the best lifted alignment of T under the requirement that string S is assigned

v

to node v .

The algorithm will compute d(v; S ) for any S 2S working it's way from the leaves up using

the following recursion:

P

 If v is an internal no de with all its children b eing leaves, then d(v; S )= D (S; S )

w

(v;w)

where S is the lab el of w .

w

P

0 0 0 0 0

0

min [D (S; S )+d(v ;S )] where v is a child of v and S is a  Else, d(v; S )=

0

S 2S

(v;v )

0

lab el of one of the leaves of T .

v

 

k

Time analysis: We p erform a prepro cessing stage, in whichwe compute all the pairwise

2

2

distances b etween the k input strings. This takes O (N ) time, where N is the total length of

2

all the strings. The work at anyinternal no de is O (k ), and the overall work of the algorithm

2 3

is O (N + k )

c

10 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98



Figure 5.4: The lifted tree T . The dashed edges show the paths along which a leaf string

L



, while each of these has b een lifted to some internal no de. Solid edges are blue edges in T

L

dashed edges has distance 0. The path P for example, is the path b; d; S along which

4

(a;b)



the string lab eling b was lifted. Edge (a; b) has distance in T at most twice the distance of

L



path P in T .

(a;b)

5.3. COMMON MULTIPLE ALIGNMENTS METHODS 11

5.3 Common Multiple Alignments Metho ds

5.3.1 Aligning a String to a Pro le

Given a database of sequences, wewould like to partition it into families of \similar" se-

quences. For this purp ose wewould like to encompass our knowledge on the common prop-

erties of the sequences in a family pro le (formal de nition to follow). Constructing a pro le

of a family enables us to identify its memb ers and test whether or not a new sequence b e-

longs to the family. Moreover, searching the database with a pro le is more sensitive than

searching using a single sequence of the family: When searching with a single sequence, we

can only lo ok for the sequences in the database with the b est alignments with the given

sequence, while when using a pro le wemay test for memb ership to the family.

0

De nition 5.15 for an alignment S of length l ,apro le is a l j [fgj matrix, whose

columns areprobability vector denoting the frequencies of each symbol in the corresponding

alignment column.

Any alignmentbetween a sequence B and a pro le P (i.e. b oth have the same length)

P

m

can b e evaluated by  (p ;b ). Clearly, using dynamic programming, we can nd the

j j

j =1

b est alignment of a sequence against a pro le.

The key in pairwise alignment is scoring two p ositions x and y :  (x; y ). For a letter x

andacolumny of a pro le, let  (x; y ) b e the probabilityofx b eing in column y . The value

for x dep ends on the frequency of it's o ccurences in the column y .We also need to devise a

score for  (x; ). In order to nd whether a given sequence is a memb er of certain family,

we use a usual pairwise dynamic programming alignment to compare the given sequence to

the family pro le.

5.3.2 Iterative pairwise alignment

This approach uses pairwise alignment scores to iteratively add one additional string to a

growing multiple alignment. We start by aligning the two strings whose is the

minimum over all pairs of strings. Then we iteratively consider the string with the smallest

distance to any of the strings already in the multiple alignment.

More generally, the algorithm works as follows:

1. Align some pair.

2. while (not done)

(a) Pick an unaligned string which is "near" some aligned one(s).

(b) Align with the pro le of the previously aligned group

Resulting new spaces are inserted into all strings in the group.

c

12 Shamir: Algorithms for Molecular Biology Tel Aviv Univ., Fall '98

5.3.3 Progressive alignment

Feng-Do olittle 1987

In this algorithm (Feng and Do olittle, 1987 [1]), the key idea is that the pair of strings with

minimum distance is almost likely to having b een obtained from the pair of ob jects that had

most recently diverged, and that the pairwise alignment of these two sp eci c strings provides

the most "reliable" information that can b e extracted from the input strings. Therefore any

spaces (gaps) that app ear in the optimal pairwise alignment of those two strings should b e

preserved in the overall multiple alignment.

The algorithm is as follows:

 

k

1. Calculate the pairwise alignment scores, and convert then to distances.

2

2. Use an incremental clustering algorithm (Fitch and Margoliash, 1967 [2]) to construct

a tree from the distances.

3. Traverse the no des in their order of addition to the tree, rep eatedly align the child

no des (sequences or alignments).

Features of this heuristic:

 Highest scoring pairwise alignment determines the alignmenttotwo groups.

 "Once a gap, always a gap": replace gaps in alignments by a neutral character.

CLUSTALW

ClustalW is a software package for multiple alignment (implementing an algorithm of Thomp-

son, Higgins, Gibson 1994 [4]). The basic idea is the same as in the Feng-Do olittle algorithm:

 

k

1. Calculate the pairwise alignment scores, and convert to two distances.

2

2. Use clustering algorithm (Neighb or-Joining) to build a tree from the distances.

3. align sequence - sequence, sequence - pro le, pro le - pro le in decreasing similarity

order.

This algorithm makesuseofmany ad-ho c rules suchasweighting, di erent matrix scores and sp ecial gap scores.

Bibliography

[1] D. Feng and R. F. Do olittle. Progressive sequence alignment as a prerequisite to correct

phylogenetic trees. J. Mol. Evol., 60:351{360, 1987.

[2] W. M. Fitch and E. Margoliash. Construction of phylogenetic trees. Science, 155:279{284,

1967.

[3] D. Gus eld. Algorithms on Strings, Trees and Sequences.Cambridge University Press,

New York, 1997.

[4] D. G. Higgins J. D. Thompson and T. J. Gibson. Clustal w: improving the sensitivity

of progressivemultiple sequence alignment through sequence weighting, p osition-sp eci c

gap p enalties and weight matrix choice. Nucleic Acids Res, 22:4673{80, 1994.

[5] E. L. Lawler T. Jiang, L. Wang. Approximation algorithms for tree alignment with a

given phylogeny. Algorithmica, 16:302{315, 1996. 13