Multiple

Reference: Gusfield, Algorithms on Strings, Trees & Sequences, chapter 14

Some slides from: • Jones, Pevzner, USC Intro to Algorithms http://www.bioalgorithms.info/ • S. Batzoglu, Stanford http://ai.stanford.edu/~serafim/CS262_2006/ • Geiger, Wexler, Technion http://www.cs.technion.ac.il/~cs236522/ • Ruzzo, Tompa U. Washington CSE 590bi • Poch, Strasbourg www. inra .fr/internet/Projets/agroBI/PHYLO/Poch. ppt • A. Drummond, Auckland, NZ Revised Nov 2015 1 CG © Ron Shamir, 09 Multiple Alignment vs. Pairwise Alignment

• Up until now we have only tried to align two sequences. • What about more than two? And what for? • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

2 CG © Ron Shamir, 09 Multiple Alignment vs. Pairwise Alignment

• “Pairwise alignment whispers … multiple alignment shouts out loud” Hubbard, Lesk, Tramontano, Nature Structural Biology 1996. Multiple Alignment Definition

Input: Sequences S1 , S 2 ,…, Sk over the same alphabet Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length

1. |S’1|= | S’2|=…= | S’k|

2. Removal of spaces from S’i gives Si for all i

4 CG © Ron Shamir, 09 Example

S1=AGGTC A G G T - C - S’ S2=GTTCG 1 Possible alignment - G - T T C G S’2 S =TGAAC 3 T G - A A C - S’3

Possible alignment

|S’1 |= |S’2 |= |S’3| A G G T - C - S’1

G T T - - C G S’2 - T G A A A C S’3 5 CG © Ron Shamir, 09 Example

6 CG © Ron Shamir, 09 Example 1

Multiple sequence alignment of 7 neuroglobins using clustalx

Identify and represent families.

7 CG © Ron Shamir, 09 Example 2

Aggregation of deamidated human βB2-crystallin and incomplete rescue by α-crystallin chaperone. Michiel et al. Experimental Eye Research 2010

Identify and represent conserved motifs (conserved common biological function).

8 Shamir n Ro © GC Protein Phylogenies – Example 3

Kinase domain

Deduce evolutionary history

9 CG © Ron Shamir, 09 Motivation again

•Common structure, function or origin may be only weakly reflected in sequence – multiple comparisons may highlight weak signals •Major uses: –Identify and represent protein families –Identify and represent conserved seq. elements (e.g. domains) –Deduce evolutionary history MSA : central role in biology

Phylogenetic studies Comparative genomics Hierarchical function annotation: homologs, domains, motifs

Gene identification, validation MSA Structure comparison, modelling

RNA sequence, structure, function Interaction networks

Human genetics, SNPs Therapeutics, drug design DBD insertion domain

Therapeutics, drug discovery LBD

binding sites / mutations Scoring alignments

•Given input seqs. S1 , S2 ,…, Sk find a multiple alignment of optimal score •Scores preview: –Sum of pairs –Consensus –Tree

12 CG © Ron Shamir, 09 Sum of Pairs score

Def: Induced pairwise alignment A pairwise alignment induced by the multiple alignment

Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

X’: ACGCGG-C; x’: AC-GCGG-C; y’: AC-GCGAG Y’: ACGC-GAC; z’: GCCGC-GAG; z’: GCCGCGAG

SOP(M) = ΣΣΣk

Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD

Scoring scheme: match - 0 mismatch/indel - -1

SP score: -3 -5 -4 =-12

14 CG © Ron Shamir, 09 Optimal MSA Alignments = Paths

• Align 2 sequences: ATGC, AATC

A -- T G C

A A T -- C

16 CG © Ron Shamir, 09 Alignment Paths

• Align 2 sequences: ATGC, AATC

0 1 1 2 3 4 x coordinate A -- T G C

A A T -- C

17 CG © Ron Shamir, 09 Alignment Paths

• Align 2 sequences: ATGC, AATC

0 1 1 2 3 4 x coordinate A -- T G C 0 1 2 3 3 4 y coordinate A A T -- C

18 CG © Ron Shamir, 09 Alignment Paths

0 1 1 2 3 4 x coordinate A -- T G C 0 1 2 3 3 4 y coordinate A A T -- C

• Resulting path in (x,y) space: (0,0) →(1,1) →(1,2) →(2,3) →(3,3) →(4,4)

19 CG © Ron Shamir, 09 Alignments = Paths

• Align 3 sequences: ATGC, AATC,ATGC

A -- T G C

A A T -- C

-- A T G C

20 CG © Ron Shamir, 09 Alignment Paths

• Align 3 sequences: ATGC, AATC,ATGC

0 1 1 2 3 4 x coordinate A -- T G C

A A T -- C

-- A T G C

21 CG © Ron Shamir, 09 Alignment Paths

• Align 3 sequences: ATGC, AATC,ATGC

0 1 1 2 3 4 x coordinate A -- T G C 0 1 2 3 3 4 y coordinate A A T -- C

-- A T G C •

22 CG © Ron Shamir, 09 Alignment Paths

• Align 3 sequences: ATGC, AATC,ATGC

0 1 1 2 3 4 x coordinate A -- T G C 0 1 2 3 3 4 y coordinate A A T -- C 0 0 1 2 3 4 z coordinate -- A T G C • Resulting path in (x,y,z) space: (0,0,0) →(1,1,0) →(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

23 CG © Ron Shamir, 09 2-D vs 3-D Alignment Grid

V

W

2-D edit graph

3-D edit graph

24 CG © Ron Shamir, 09 Aligning Three Sequences source • Same strategy as aligning two sequences • Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align • For global alignments, go from source to sink sink 25 CG © Ron Shamir, 09 3-D cell versus 2-D Alignment Cell

In 2-D, 3 edges in each unit square

In 3-D, 7 edges in each unit cube

26 CG © Ron Shamir, 09 Architecture of 3-D Alignment Cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k)

(i,j,k-1) (i,j-1,k-1)

(i,j-1,k) (i,j,k)

27 CG © Ron Shamir, 09 Architecture of 3-D Alignment Cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k)

Edge: 2 indels Cube diagonal: no indels Face diagonal: 1 indels (i,j,k-1) (i,j-1,k-1)

(i,j-1,k) (i,j,k)

28 CG © Ron Shamir, 09 Multiple Alignment: s + δ(v , w , u ) cube diagonal: i-1,j-1,k-1 i j k no indels si-1,j-1,k + δ (v i, wj, _ ) • si,j,k = max si-1,j,k-1 + δ (v i, _, uk) face diagonal: si,j-1,k-1 + δ (_, wj, uk) one indel si-1,j,k + δ (v i, _ , _) si,j-1,k + δ (_, wj, _) edge: two indels si,j,k-1 + δ (_, _, uk)

• δ(x, y, z ) is an entry in the 3-D scoring matrix

29 CG © Ron Shamir, 09 Running Time •For 3 sequences of length n, the run time is O( n3)

•For k sequences, build a k-dimensional cube, with run time O( 2knk) [another 2 k factor for affine gaps]

•Impractical for most realistic cases

•NP-hard (Elias’03 for general matrices) 30 CG © Ron Shamir, 09 Minimum cost – SOP

We use min cost instead of max score

Find alignment of minimal cost

Observe: opt SOP score ≥ sum of optimal pairwise scores

31 Shamir n Ro © GC Forward Dynamic Programming

•An alternative approach to DP. Useful for SOP alignment : •D(v) – opt value of path source ⋅⋅⋅ v •p(w) – best-yet solution of path source ⋅⋅⋅ w •When D(v) is computed, send its value forward on the arcs exiting from v: For vw: p(w)=min{p(w),D(v)+cost(v,w)} •Once p(w) has been updated by all incoming edges – that value is optimal; set as D(w)

32 CG © Ron Shamir, 09 Forward Dynamic Programming (2) •Maintain a queue of nodes whose D is not set yet •For the node w at the head of the queue: Set D(w) p(w) and remove •∀ out-neighbor x of w – update p; if x is not in the queue – add it at the end –Breaking ties lexicographically •Same complexity as the regular (backwards) DP

33 CG © Ron Shamir, 09 Faster DP Algorithm for SOP alignment Carillo-Lipman 88 • Idea: after computing D(v), with a little extra computation, we may already know that v will not on any optimal solution .

• ∀ k,l, k

D(i,j,k)+ f 12 (i,j) + f 13 (i,k) +f 23 (j,k) > z Do not send D(i,j,k) forward • Guarantees opt SOP alignment – no improved time bound, but often saves a lot 34 CGin © Ron practice. Shamir, 09 Approximation algorithms Approximation Algorithms - assumption

We use min cost instead of max score

Find alignment of minimal cost

Assumption: the cost function δ is a distance function • δ(x,x ) = 0 • δ(x,y ) = δ(y,x ) ≥ 0

• δ(x,y ) + δ(y,z ) ≥ δ(x,z ) (triangle inequality)

D(S,T) - cost of minimum global alignment between S and T

36 CG © Ron Shamir, 09 The Center Star algorithm Gusfield 1993

Input: Γ - set of k strings S1, …,S k.

D(S*, S) 1. Find the string S* ∈ Γ (center ) that minimizes ∑ S∈Γ\{}S* 2. Denote S1=S* and the rest of the strings as S2, …,Sk

3. Iteratively add S2, …,Sk to the alignment as follows:

a. Suppose S1, …,S i-1 are already aligned as S’1, …,S’i-1

b. Optimally align Si to S’1 to produce S’i and S’’ 1 aligned

c. Adjust S’2, …,S’i-1 by adding spaces where spaces were added to S’’ 1

d. Replace S’1 by S’’ 1

37 CG © Ron Shamir, 09 The Center Star algorithm x: AGAC y: ATGA <- center (demonstration) z: ATGGA YYY’Y’’’: ATGATGAAAA w: AGTGA XXX’X’’’:::: AAAAGGGGACACACAC w ZZZ’Z’’’:::: ATGGAATGGA y x z YYY”Y”””:::: AAAATGTGTGTGAAAA WWW’W’’’:::: AGTGAGTGAAAA YYY’Y’’’:::: AAAATGTGTGTGAAAA XXX’X’’’:::: AAAAGGGGACACACAC Inheriting ZZZ’Z’’’:::: AAAATGGATGGATGGA gaps WWW’W’’’:::: AGTGAGTGAAAA The Center Star algorithm Running time

2 2 • Choosing S1 – execute DP for all sequence-pairs - O(k n )

2 • Adding Si to the alignment - execute DP for Si , S’1 - O(i·n ).

th (In the i stage the length of S’1 can be up-to i· n)

k −1 ∑O()()i ⋅ n 2 = O k 2 n 2 i=1

total complexity

39 CG © Ron Shamir, 09 The Center Star algorithm Approximation ratio

• M* - An optimal alignment • M - The alignment produced by the center-star algorithm

• d(i,j ) - The distance M induces on the pair Si,S j • k k v()()()M = ∑∑ d i, j = 2∑ d i, j i=1j= 1 i< j j≠i •recall D(S,T) – min cost of alignment between S and T

For all i: d(1,i)=D(S 1,S i)

(we perform optimal alignment between S’1 and Si and δ(-,-) = 0 ) 40 CG © Ron Shamir, 09 The Center Star algorithm Approximation ratio (2) SOP MA Triangle inequality + symmetry k k k k v()()M = ∑∑ d i, j ≤ ∑∑()d()(),1 i + d ,1 j i=1j= 1 i=1j= 1 j≠i j≠i k k k D()S S = (2 k − )1 ∑ d(),1 l = (2 − )1 ∑ 1, l l=2 l=2 v(M) (2 k − )1 ≤ ≤ 2 k k k k * * * v(M ) k v()M = ∑∑ d ()i, j ≥ ∑∑ D()Si , S j ≥ i=1j= 1 i=1j= 1 j≠i j≠i Definition of S 1: k k k k k ∀i : D()()S , S ≤ D S , S ≥ D()S , S = k D()S1, S j ∑ 1 j ∑ i j ∑∑ 1 j ∑ j=2 j=1 i=1j = 2 j=2 j≠i 41 CG © Ron Shamir, 09 The Center Star algorithm Theorem (Gusfield 93)

• We have proved: • The center star algorithm is a polynomial algorithm that guarantees a solution at most twice the optimum of SOP alignment.

• “a 2-approximation” • “an approximation ratio of 2” 42 CG © Ron Shamir, 09 Steiner String and Consensus MA

43 CG © Ron Shamir, 09 Consensus error & Steiner string -definitions

•Input: set of k strings Γ ={S 1, …,Sk}. •D(X,Y) – score of aligning X, Y. •S – arbitrary sequence (unrelated to Γ)

•The consensus error of S relative to Γ: E(S) = Σ D(S, S i)

•S* is an optimal Steiner string for Γ if it minimizes E(S) •Different objective function – linear no of terms •No direct relation to multialign! (for now) 44 CG © Ron Shamir, 09 Optimal Steiner String: Approximation

Thm: Assume D satisfies triangle ineq. Then ∃ S∈Γ that guarantees an approximation ratio 2. Pf: Pick S ∈Γ

E(S) = D(S,S ) ≤ (D(S, S *)+ D(S*, Si )) ∑S≠S i ∑S ≠Si i = (k − )2 D(S, S*) + D(S, S*) + D(S*, S ) ∑S ≠S i i = (k − )2 D(S, S*) + E(S*) E(S) (k − )2 Pick S ∈Γ closest to S* ( not constructively ) ≤ +1< 2 E(S*) k E(S*) = D(S*, S ) ≥ k ⋅ D(S, S*) ∑S ∈Γ i 45 i CG © Ron Shamir, 09 Optimal Steiner String: Approximation

Resulting algorithm:

Pick Sc ∈Γ that minimizes E(S c) (S c is the center string).

Approximation:

Sc gives a 2-approximation. The center string has a consensus error at most 2 times the consensus error of the optimal Steiner string.

46 Shamir n Ro © GC Consensus String

• Given multiple alignment, the consensus character i is the character that minimize the summed distance to it from all the characters in column i of the MA. • For example, given scoring scheme with match = 0, mismatch or indel=1, the consensus character is the most frequent character. • The consensus string derived from alignment is the concentration of consensus characters. Consensus string versus Steiner string definitions

• d(S,T) - The distance that a given multiple alignment M induces on the pair S,T • D(S,T) – min cost of alignment between S and T

• Optimal Steiner string S*: minimizes Σi D(S*, Si) • Optimal consensus string Scon : minimizes Σi d(S con , S’i) Consensus multiple alignment

• Scon : AC-GC-GAG • x: AC-GCGG-C • y: AC-GC-GAG • z: GCCGA-GAG • u: AC-T-GGCA • v: -CAGT-GAG • w: AC-GC-GAG

Alignment error (M(M(M)(M ) = ΣΣΣiii d(Sd(Sd(S iii, SSScon )))

The opt consensus MA : one with least alignment error

49 CG © Ron Shamir, 09 Consensus multiple alignment Thm: optimal solution of consensus MA = Steiner string (up to spaces) •Opt consensus string = Steiner string (up to spaces) • Optimal alignment error = optimal consensus error

Pf: ex.

Gusfield theorems 14.7.2 and 14.7.3 A summary

Sc S* S Center Optimal Optimalcon string Steiner consensus string string

Center star Optimal algorithm consensus alignment

2- Optimal Alignment <= Consensus appr == Optimal error error consensus (consensus) Pf: ex. error alignment Σi D(Sc, Si) error Σi D(S*, Si) Σ d(S , S’i) Σi d(S con , S’i) <= Σi d(Sc, S’i) i con = Σi D(Sc, Si) A summary

Center Optimal Optimal string Steiner consensus string string

Optimal Center star SOP Optimal algorithm alignment consensus alignment

SOP 2- Optimal Optimal Alignment <= Consensus appr == Optimal SOP 2- score error error consensus (consensus) Pf: ex. score appr error alignment error 2-approximation A summary

Center Optimal Optimal string Steiner consensus string string

Optimal Center star SOP Optimal algorithm alignment consensus alignment

2- Optimal Optimal SOP Alignment <= Consensus appr == Optimal SOP 2- score error error consensus (consensus) Pf: ex. score appr error alignment error 2-approximation Theorem

• Assuming the triangle inequality, the multiple alignment created by the center star method has: - A sum-of-pairs (SOP) score that is never more than 2 times the SOP score of the optimal SOP alignment - An alignment error and that is never more than 2 times the alignment error of the optimal consensus alignment. “Tree alignment”, “phylogenetic alignment Scoring alignments

•Given input seqs. S1 , S2 ,…, Sk find a multiple alignment of optimal score •Scores preview: –Sum of pairs Optimal SOP alignment –Consensus Optimal Consensus alignment

Optimal phylogenetic/tree alignment

58 CG © Ron Shamir, 09 Phylogenetic/tree MA

• Input: Tree T, a string for each leaf • Phylogenetic/tree alignment problem : Assignment of a string to each internal node in T CTGG CTGG CCGG GTTG CTTG GTTG GTTC

• Score – sum of scores along edges ΣΣΣ(((i.j(i.ji.ji.j)) in T D(S iii, SSSjjj))) • Goal: find tree alignment of optimal score

• Consensus = tree Alignment where T is a star 59 CG © Ron Shamir, 09 Tree MA – complexity

•NP-hard •Lifted alignment : efficient heuristic method. Poly time, 2- approximation assuming triangle inequality

• In recitation

60 CG © Ron Shamir, 09 Multiple Alignment: Greedy Heuristic Profile Representation of MA

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4 • Alternatively, use log odds:

• pi(a) = fraction of a’s in col i • p(a) = fraction of a’s overall • log pi(a)/p(a) 68 69 Shamir n Ro © GC Aligning a sequence to a profile

•Key in pairwise alignment is scoring two letters x,y: σ(x,y) •For a letter x and a column C in a profile, σ(x,C)=value of x in col. C •Invent a score for σ(x,-) •Run the DP alg for pairwise alignment Aligning alignments

•Given two alignments, how can we align them? •Hint: use DP on the corresponding profiles. x GGGCACTGCAT x GGGCACTGCAT y GGTTACGTC-- Alignment 1 y GGTTACGTC-- z GGGAACTGCAG z GGGAACTGCAG w GGACGTACC-- w GGACGTACC-- Alignment 2 v GGACCT----- v GGACCT-----

71 Profile-profile scoring

•Fix a position in the alignment st nd –pi – prob (i in 1 profile); qi – in 2 profile

•Expected score: Σ ij pi qj σ(i,j) •Other scores in use: –Euclidean distance –Pearson correlation –KL-divergence (relative entropy) Multiple Alignment: Greedy Heuristic

• Choose most similar pair of sequences and combine into a profile, thereby reducing alignment of k sequences to an alignment of k-1 sequences/profiles. Repeat

u = ACg/tTACg/tTACg/cT… u1= ACGTACGTACGT… 1 u = TTAATTAATTAA… u2 = TTAATTAATTAA… 2 k-1 … k u3 = ACTACTACTACT…

… uk = CCGGCCGGCCGG…

uk = CCGGCCGGCCGG Progressive Alignment

•A variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. Progressive alignment

Align sequences (pairwise) in some (greedy) order

Decisions

(1) Order of alignments (2) Alignment of group to group (3) Method of alignment, and scoring function

76 Guide tree

A this ? B C

D

E

A or this ? B C

D E

F

77 ClustalW Thompson, Higgins, Gibson 94

• Popular multiple alignment tool today • Three-step process 1.) Construct pairwise alignments 2.) Build guide tree 3.) Progressive alignment guided by the tree

92 Shamir n Ro © GC Step 1: Pairwise Alignment

• Aligns each pair of sequences, giving a similarity matrix • Similarity = exact matches / sequence length (percent identity) v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - (.17 means 17 % identical) v4 .59 .33 .62 -

93 Shamir n Ro © GC Step 2: Guide Tree

• Use the similarity method to create a guide tree by applying some clustering method*

• Guide tree roughly reflects evolutionary relations

• *ClustalW uses the neighbor-joining method (to be described later in the course)

94 Shamir n Ro © GC Step 2: Guide Tree (cont’d)

v1 v v1 v2 v3 v4 3 v 1 - v4 v 2 .17 - v2 v3 .87 .28 - v4 .59 .33 .62 - Calculate:

vvv1,3 = alignment= (v(v(v111, v333))) vvv1,3,4 = alignment= ((v1,3),v444))) vvv1,2,3,4 = alignment= ((((((vvv1,3,4),v 222)))

95 Shamir n Ro © GC Step 3: Progressive Alignment

• Start by aligning the two most similar sequences • Using the guide tree, add in the most similar pair (seq-seq, seq-prof or prof-prof) • Insert gaps as necessary • Many ad-hoc rules: weighting, different matrices, special gap scores ….

FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP------LPFQ . . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is. 96 Shamir n Ro © GC Multiple Alignment: History 1975 Sankoff Formulated multiple alignment problem and gave dynamic programming solution 1988 Carrillo-Lipman Branch and Bound approach for MSA 1990 Feng-Doolittle Progressive alignment 1994 Thompson-Higgins-Gibson-ClustalW >40K citations! Most popular multiple alignment program 1998 Morgenstern et al.-DIALIGN Segment-based multiple alignment 2000 Notredame-Higgins-Heringa-T-coffee Using the library of pairwise alignments 2002 MAFFT 2004 MUSCLE 2005 ProbCons 2011 Clustal Omega 98 Shamir…… n Ro © GC Still a lot to be done!