<<

Introduction hyeovdfo omnancestor. common a from evolved arethey sequences more or Two ) (Sequence Definition ass iiaiyo eune n er bu their about learn and sequences of similarity assess Motivation: Ainetraoal,i eune homologous sequences if reasonable, Alignment Homology: Example: TCCTA ACTA ACCCGA this? know to Sequences want we do Why relationship evolutionary ACCCGA euneAlignment Sequence C ACCGA ⇒ homologous TCCTA T align ACCTA TCC-TA AC--TA ACCCGA Alignment T iff ACTA C Hmlg nanatomy] in [Homology

S.Will, 18.417, Fall 2011 Introduction x of set the An words) (Alphabet, Definition the ∈ • • • alphabet mt word empty Σ ae:etn arieainett utpealignment multiple to alignment pairwise extend Later: alignment local penalties, algorithm gap efficient define and distances, problem of alignment define equivalence distance, show , define alignment comparison: sequence pairwise For two Fix Σ of elements The alignment. Fix pairwise only study First: n has alphabet o-mt words non-empty n length a sequences safiiest(of set finite a is Σ ln(n oePreliminaries) Some (and Plan flnt 0. length of ,sc that such Σ, written , ∗ , r called are b ∈ fΣ ..Σ i.e. Σ, of | x Σ 6∈ − | ∗ Σ . . symbols/characters sequences ∗ Σ. =Σ := − + + := scle the called is { ∪ S . i  > } 0 where , Σ .Σ ). i word A . a symbol gap +  denotes denotes .

S.Will, 18.417, Fall 2011 Introduction nbooy prtoshv ieetcs.(Why?) cost. different have operations biology, In 3: distance most) ACCCGA (at have ACTA and ACCCGA Example other. to the deletions into and one insertions transform substitutions, of number minimal The Definition eesti Distance Levenshtein → ACCGA eesti Distance Levenshtein → ACCTA ewe w od/eune sthe is words/sequences two between → ACTA

S.Will, 18.417, Fall 2011 Introduction ACCCGA Recall: ACCCGA Example (x,y) An Operations) (Edit Definition to ( operation sequences For • • • b dtoperation edit insertion deletion substitution yasqec fei operations edit of sequence a by 6∈ − ⇒ → x , Σ, ( ( C C iff y iff , , a .Frhroe write Furthermore, ). − − a , y ) ) dtDsac:Operations Distance: Edit , x b , iff ( b write , ACCGA = sapi ( pair a is G = , r eune nΣ in sequences are x T − − ) 6= , ( − − a , T → → and ) x ATCCTA ( , ( G x y , , y T y ) ) ) ∈ 6= b ACCTA iff , (Σ − a S ∗ ⇒ − 6 {−} ∪ . a S stasomdto transformed is b → iff , ( ( = − , T a − ) stransformed is , ATCCTA − .W call We ). b by

S.Will, 18.417, Fall 2011 Introduction The S operation Let Distance) Edit (Cost, Definition = w dtdsac fsqecsaadb and a sequences of distance edit e dtDsac:Cs n rbe Definition Problem and Cost Distance: Edit 1 (Σ : , . . . , ( {−} ∪ x e , n y The ). is d ) w 2 ( → a , oto euneo dtoperations edit of sequence a of cost b R w min = ) ˜ uhthat such , ( S = ) { X i w =1 ˜ n ( S w w ) ( ( e x | is 1 , a ) y . ⇒ sthe is ) S b } . oto nedit an of cost

S.Will, 18.417, Fall 2011 Introduction The S operation 1)frmti d, metric for 1.) Remarks: function A (Metric) Definition reasonable? definition the Is Let Distance) Edit (Cost, Definition 2.) . ntefloig assume following, the In 3.) = d w dtdsac fsqecsaadb and a sequences of distance edit e ( dtDsac:Cs n rbe Definition Problem and Cost Distance: Edit 1 x (Σ : , . . . , , y = ) ( {−} ∪ x d e , n : y d X The ). is ( y 2 d ) , → w 2 x ( → 3.) ) a R , oto euneo dtoperations edit of sequence a of cost b R scalled is w min = ) ˜ uhthat such , d d ( ( d S x ( w x , = ) y smetric. is , ) y ) ≥ { metric X i w =1 ≤ ˜ n ,2.) 0, ( d S w w ( ) ( ( x e x | 1.) iff , is d 1 , a z w ) y + ) . ⇒ sthe is ) smti iff metric is S d d b ( ( } z x . , , y oto nedit an of cost y ). iff 0 = ) w ( x , y ) x ≥ = 0, y

S.Will, 18.417, Fall 2011 Introduction The S operation Remarks Let Distance) Edit (Cost, Definition = • • w dtdsac fsqecsaadb and a sequences of distance edit e o biu o ocmueei itneefficiently distance edit compute ⇒ to how obvious Not definition. problem ’-motivated’ Natural dtDsac:Cs n rbe Definition Problem and Cost Distance: Edit 1 (Σ : , . . . , en lgmn distance alignment define ( {−} ∪ x e , n y The ). is d ) w 2 ( → a , oto euneo dtoperations edit of sequence a of cost b R w min = ) ˜ uhthat such , ( S = ) { X i w =1 ˜ n ( S w w ) ( ( e x | is 1 , a ) y . ⇒ sthe is ) S b } . oto nedit an of cost

S.Will, 18.417, Fall 2011 Introduction Example b and a sequences words of pair A (Alignment) Definition dtoeain ffis lgmn:(A,-),(-,C),(G,C),(-,T),(A,-) alignment: first of operations edit are alignments possible b a b a dltn l a symbols gap all deleting 3. 1 all for 2. 1.   = = = = n eeigall deleting and | a ACGGAT CCGCTT AC-GG-AT  -CCGCT-T | = | b ≤  | i a | ≤  or , ( b a a  − lgmn Distance Alignment   b a ∈ | and   : from (Σ = = a i  ACGG---AT --CCGCT-T b 6= {−} ∪  b −  − r called are yields from or ) ∗ b scalled is i  a b 6=  r... . . or yields lgmn strings alignment − epnnilymany) (exponentially lgmn of alignment a ,iff ),

S.Will, 18.417, Fall 2011 Introduction The is operations ento Cs fAinet lgmn Distance) The Alignment Alignment, of (Cost Definition D w lgmn distance alignment oto h alignment the of cost ( a , b min = ) { w w lgmn Distance Alignment ( a ( a  faadbis b and a of ,  b , b (  a )   = ) | , ( b a   X ie otfunction cost a given ), i | a , =1  b |  w sainetof alignment is ) ( a i  , b i  ) a and w nedit on b } .

S.Will, 18.417, Fall 2011 Introduction D The Distance) (Alignment Definition d The Distance) (Edit Definition Recall: d Distance) w, metric Alignment For and Edit of (Equivalence Theorem w w ( ( a a lgmn distance alignment b and a of distance edit , , b b min = ) min = ) lgmn itne=Ei Distance Edit = Distance Alignment { { w w ˜ w ( ( a ( S , a b )  = ) , | b a  faadbis b and a of ) rnfre to transformed D | w ( a ( is a  , , b b )  . sainetof alignment is ) b ye.o.-sequence by a and b } S . } .

S.Will, 18.417, Fall 2011 Introduction Remarks d Distance) w, metric Alignment For and Edit of (Equivalence Theorem • • • d idea: Proof ewl e:teainetdsac scmue ffiinl by efficiently Optimality computed is (using distance programming alignment dynamic the see: will distance We alignment to distance edit Reduces inequality) triangle (needs alignment better D w w ( ( a a , lgmn itne=Ei Distance Edit = Distance Alignment , b b ) ) ≤ ≤ ). w D d ( w w a , ( ( a a b , , = ) b b :ainetyed euneo dtops edit of sequence yields alignment ): :sqec fei p ilseulor equal yields ops edit of sequence ): D w ( a , b ) . ela’ rnil of Principle Bellman’s

S.Will, 18.417, Fall 2011 Introduction hretPath Shortest Example: solutions’ partial optimal of consist solutions ‘Optimal Optimality of Principle Remarks (DP): Programming Dynamic of Idea rnil fOtmlt n yai Programming Dynamic and Optimality of Principle • • • • • ilg n o ilme tqieotni hsclass this in often quite it meet will Computational you in and used Biology widely DP is method programming programming Dynamic the problem enables distance Optimality alignment of the Principle for valid is principle The ones smaller on based problems larger results solve materialize (recursively) and first problems partial Solve :

S.Will, 18.417, Fall 2011 Introduction Idea: oainlremarks Notational D The matrix) (Alignment Definition solutions. Let partial these of matrix define and solutions partial D ij • • • =( := n := lgmn matrix alignment yconvention by a a hoeainetdsacso prefixes of distances alignment choose x i := D .. steit hrce of character i-th the is D min = y ij | w ) a stesequence the is 0 ( | ≤ , a 1 i m ≤ .. { i n w , := , 0 b ( ≤ 1 a .. j |  a ≤ b j , x ) | m b . .. lgmn Matrix Alignment y  of endby defined ) = | a a (  x and a a  if x , +1 a b x b  > sainetof alignment is ) . . . ste( the is y a . y ( subsequence n 1) + a 1 .. i × and ( a m 1 .. i of 1)-matrix + b 1 and .. a j ). as b 1 .. j }  .

S.Will, 18.417, Fall 2011 Introduction =dtdsac)of distance) (=edit matrix alignment The Remark: Example • • w a = ( x AT , y , = ) b = ( AAGT lgmn arxExample Matrix Alignment otherwise 1 iff 0 T A a and x = b y in T G A A D D n , m otisteainetdistance alignment the contains .

S.Will, 18.417, Fall 2011 Introduction =dtdsac)of distance) (=edit matrix alignment The Remark: Example • • w a = ( x AT , y , = ) b = ( AAGT lgmn arxExample Matrix Alignment otherwise 1 iff 0 T A a and x 2 1 4 3 2 1 0 = b y in T G A A 2 1 0 D D n , m otisteainetdistance alignment the contains 12 11 . 2 3

S.Will, 18.417, Fall 2011 Introduction a ecmue rmacntn ubro mle ones. smaller of number constant ??? distance a Proof alignment from prefix computed each be that can claims theorem The Remark: that holds b, and a of D matrix alignment the For Theorem ( For Claim • • • • D all for o all for D a  ij 0 , , 0 b min =  0 = lgmn of alignment ) 1 1 w ≤ ≤ ( edea-ushAlgorithm Needleman-Wunsch a      j i  , D D D ≤ ≤ b i i i  , − − :D n: :D m: j = ) − 1 1 , , 1 j j − + + i 1 w , 0 0 a + w w , ( j = a and ( ( = w 1  − a .. P i ( r P , , − a b − b k i i 1 j k j , =1 ( ) , ( ) =1 ihlength with b b j 1  ( ) w .. w r ( − ( a − k 1 insertion deletion match + ) , , − b k = ) = ) w r ) ( = a ) D ) r  D i , | − 0 a b , 1  j r  − , | ) 0 , . 1 + + w w ( ( a − i , , − b ) j )

S.Will, 18.417, Fall 2011 Introduction a ecmue rmacntn ubro mle ones. smaller of i+j over number Induction constant distance a : Proof alignment from prefix computed each be that can claims theorem The Remark: that holds b, and a of D matrix alignment the For Theorem ( For Claim • • • • D all for o all for D a  ij 0 , , 0 b min =  0 = lgmn of alignment ) 1 1 w ≤ ≤ ( edea-ushAlgorithm Needleman-Wunsch a      j i  , D D D ≤ ≤ b i i i  , − − :D n: :D m: j = ) − 1 1 , , 1 j j − + + i 1 w , 0 0 a + w w , ( j = a and ( ( = w 1  − a .. P i ( r P , , − a b − b k i i 1 j k j , =1 ( ) , ( ) =1 ihlength with b b j 1  ( ) w .. w r ( − ( a − k 1 insertion deletion match + ) , , − b k = ) = ) w r ) ( = a ) D ) r  D i , | − 0 a b , 1  j r  − , | ) 0 , . 1 + + w w ( ( a − i , , − b ) j )

S.Will, 18.417, Fall 2011 Introduction for D n for end for for end for for end 0 D n for end for D edea-ushAgrtm(Pseudocode) Algorithm Needleman-Wunsch , 0 i j i i 0 , D =0 := , =1to 1 := =1to 1 := 0 =1to 1 := j j i := := , j =1to 1 := =min := D D i 0 − , n n m j 1 − , do do 0 1 m do      + + do D D D w w i i i , − − ( ( j − a − 1 1 , , i 1 j j , , − + + − b 1 j ) ) + w w ( ( w − a i ( , , a b − i j , ) ) b j )

S.Will, 18.417, Fall 2011 Introduction pn o ofidbs alignment? best find to how Open: Example • • w a = ( x AT , y , = ) b = ( AAGT otherwise 1 iff 0 T A akt Example to Back x 2 0 1 4 3 2 1 0 = y T G A A

S.Will, 18.417, Fall 2011 Introduction pn o ofidbs alignment? best find to how Open: Example • • w a = ( x AT , y , = ) b = ( AAGT otherwise 1 iff 0 T A akt Example to Back x 2 1 4 3 2 1 0 = y T G A A 2 1 0 12 11 2 3

S.Will, 18.417, Fall 2011 Introduction w Remarks ( • • • x , euneof Sequence unique. necessarily Not tr n( in Start y = ) ( otherwise 1 iff 0 n , m rc arrows trace T A .Freey( every For ). x = y 2 1 4 3 2 1 0 Traceback T G A A 2 1 0 e’ ne etalignment. best infer let’s i , j eemn pia case. optimal determine ) 12 11 2 3

S.Will, 18.417, Fall 2011 Introduction w Remarks ( • • • x , euneof Sequence unique. necessarily Not tr n( in Start y = ) ( otherwise 1 iff 0 n , m rc arrows trace T A .Freey( every For ). x = y 2 1 4 3 2 1 0 Traceback T G A A 2 1 0 e’ ne etalignment. best infer let’s i , j eemn pia case. optimal determine ) 12 11 2 3

S.Will, 18.417, Fall 2011 Introduction Remarks • • • • • • TOTAL: traceback: opt n nr:trecss ..cntn time constant i.e. nm cases, three entry: one compute assuming n oqe”frcmuigtrace) computing for “Divide Conquer” and uses and row”) Hirschberg-algorithm last involved; and (more current traceback only “store to (simple, improved distance be can complexity space entries O m ( O ⇒ n ≤ ( 2 n ieadsae(assuming space and time ) l arxin matrix fill n + swlog ic ecnexchange can we since w.l.o.g. is m time ) Complexity O ( nm time ) O ( n o optto of computation for ) m ≤ n ) a and b

S.Will, 18.417, Fall 2011 Introduction • • eoegigmlil,w illo ttofrhrspecial further two alignment: at pairwise look for will topics we and multiple, distance going Before edit pairwise alignment. the optimal compute corresponding to the how seen have We • • iiaiysoe n oa alignment local and and scores cost similarity gap non-linear realistic, more Plan

S.Will, 18.417, Fall 2011 Introduction Motivation: • • • hsmas a otsol ennlna,sub-additive! non-linear, be should ones. cost small two gap likely than means: more gap This is large it one reasonable: introduces more evolution biologically that is one first The distance. alignments The lgmn otRevisited Cost Alignment GAAGT GA--T and GAAGT G-A-T aetesm edit same the have

S.Will, 18.417, Fall 2011 Introduction The in gaps of multi-set symbols gap only A Example: A Penalty) (Gap Definition b a gap a penalty gap   = = w lgmn otwt a eat of g penalty gap with cost alignment g -TGCGGCG-CTTTC ATG---CGAC--GC na lgmn string alignment an in ( a  , b  = ) safunction a is where + − x a n smxmlyetne.∆ extended. maximally is and ∈ 1  g ≤ ∆ . a ( X X r r  a k ≤| 6=  ] a Penalty Gap + − ⇒ a ∆  , b g | l b , r  )  a ∆ 6= :  ≤ g − a N ( sasbtigof substring a is  | g w x → = ( | ( k ( ) a { + ) R r  --- , b hti u-diie i.e. sub-additive, is that r  g , ( ) ( ( -- l a )  . } , ∆ , b oto mismatchs of cost  a is ) b  a   htcnit of consists that = sthe is oto gaps of cost { - , - } ) )

S.Will, 18.417, Fall 2011 Introduction Remarks gap and w cost with b and D a that of such matrix g, alignment penalty the be D Let Theorem • • • • • • • uhmr elsi,btsgicnl oeepniethan expensive more significantly Needleman-Wunsch but realistic, more exercise much as left traceback correctness, pseudocode, D all for Complexity all for D ij 0 , 0 min = 0 = 1 1 eea u-diiegppenalty gap sub-additive General ≤ ≤      j i O min min D ≤ ≤ ( i − n :D n: :D m: 3 1 1 1 ≤ ≤ time, ) , j ij − k k ≤ ≤ i 1 = , 0 ⇒ 0 j i + , j D D w = = a eipoeit? improve we can w i i g − , O j g ( ( − g k a a ( ( , k ( 1 j i n i j ) .. , + + 2 ) i b space ) , j g b g ( ) 1 ( ( .. k k j ( ) ( ) ) Then: . neto flnt k length of insertion eeino eghk length of deletion match ) ) )

S.Will, 18.417, Fall 2011 Introduction Remarks constants real are there all iff for affine, that is penalty gap A Definition • • h iia lgmn otwt ffiegppnlycnbe can penalty gap affine in with computed cost alignment minimal The reasonable”. ones: “biologically general ( as opening good gap as Distinguishing almost penalties gap Affine k ∈ N : O g ( ( n k 2 = ) ie Gthalgorithm) (Gotoh time! ) ffiegpcost gap Affine α + β k α . n a xeso ot( cost extension gap and ) α and β such , β is )

S.Will, 18.417, Fall 2011 Introduction Recursions: Remark: nadto oteainetmatrix alignment matrices/states. the to addition In • • hted ihinsertion with ends that B deletion with ends that A i i , , D j j ij =cs fbs lgmn of alignment best of cost := of alignment best of cost := O min = B ( A n i , i 2 j , j ieadspace and time ) oo loih:sec only sketch algorithm: Gotoh min = min =      B A D i i i , , − ( j j ( 1 D B , D A j i − i , i , i j j − − 1 − − 1 1 + 1 1 , , j − a j | + − b + i | w + . + j . β ( g β g a 1 ( (1) i 1 ( (1) , D b j en w further two define , a a ( ) 1 1 ( ( .. .. neto opening insertion extension insertion eeinopening deletion extension deletion i i ( ( , , neto closing insertion closing deletion match b b 1 1 .. .. j j , , ) ) ) ) ) ) )

S.Will, 18.417, Fall 2011 Introduction bevto.Isedo iiiigainetcs,oecan one cost, similarity: alignment maximize minimizing of Instead Observation. for where ( alignment an of similarity The (Similarity) Definition Motivation: • • x iiaiyi sflfor useful is natural, acids. similarity more amino be of similarity could blocks’ e.g. ’building of similarity defining ∈ s : Σ (Σ : s ( {−} ∪ x , x S ) ij > ) max = 2 s 0, → ( a s  R ( , x b oa alignment local      , sa is Similarity  − = ) S S S i i i ) − , − j iiaiyfunction similarity a < − 1 1 X  i | , , 1 a =1 j j , 0, −  b + + | 1  s s is ) + s s ( ( ( ( a − a − s i  i , ( , , , a b x b − i i  ) j , ) ) ) b < , j ) 0. where ,

S.Will, 18.417, Fall 2011 Introduction eg LS obnshuitc n oa alignment) local and heuristics combines two BLAST any (e.g. of alignment best of the subsequences for asks alignment Local hr stesrne oa oi?Ol iiaiycndistinguish. can similarity Only motif? local stronger the is Where a) Example useful? alignments”. not “global distance compute is methods Why previous all contrast, In b a Example = = AWGVIACAILAG VIVTAIAVAG XXXAAXXXX YYAAYY oa lgmn Motivation Alignment Local YY a b) and RS YYYAAAAAYY b XXAAAAAXXXX motn plcto:Search! Application: Important .

S.Will, 18.417, Fall 2011 Introduction Remarks The Let Problem) Alignment (Local Definition • • • s S oa lgmn problem alignment local o ol edfietelclainetmti o DP? “ for does matrix why alignment example, local For the define we would How of alignment. subsequences best the the for have asks that alignment local is, That S global easmlrt nalignments. on similarity a be local ( ( a a , , b b =max := ) =max := ) 1 of alignment 1 ≤ ≤ j i 0 0 oa Alignment Local < < j ( i ≤ ≤ a  m n , b S  st compute to is H ) a global i and , j := ( b a s i S 0 ( .. local a i  , , b b ( j 0 a  .. S ( ) 1 j .. ( ) local i , b ( 1 a .. lblsimilarity global j , oa similarity local ”ntwork? not )” b ). a and b ) )

S.Will, 18.417, Fall 2011 Introduction Remarks by defined The Definition • • • • H matrix alignment local loscs itnto rnil fotmlt holds! optimality of Principle / distinction case Allows respective H l entries all S local i , j mle osbeune of subsequences no implies 0 = ( a , H b i max = ) , j H i =max := i and , j oa lgmn Matrix Alignment Local ≥ 0 ≤ j ,since 0, i i 0 r similar. are , ≤ j i H , 0 i ≤ , j j 0 (!) ≤ S of j global S global a and ( , , ( a b i 0. = ) 0 a +1 s( is and .. i H , b i b , j j 0 ) +1 htedin end that 0 ≤ .. i j ≤ ) . n , 0 ≤ j ≤ m

S.Will, 18.417, Fall 2011 Introduction ae for Cases . ,snei aho h bv ae sdsiia ie negative), (i.e. dissimilar is cases above ( the still of is each there if since 0, 4.) 1.) oa lgmn loih aeDistinction Case — Algorithm Alignment Local ...... H b a i i i , j , , 2.) )...... a − i 3.) ...... b − j

S.Will, 18.417, Fall 2011 Introduction oa lgmn loih SihWtra Algorithm) (Smith-Waterman Algorithm Alignment Local o h oa lgmn arxHo n b, and a of H matrix alignment local the For Theorem • • • • H all for all for H ij 0 , 0 max = 0 = 1 1 ≤ ≤            j i ≤ ≤ ( H H 0 H i i i :H n: :H m: − − , j − 1 1 , , 1 j j − + + i 1 , 0 0 , + s s j 0 = ( ( 0 = a − s i ( , , a b − i j , ) ) b j ) mt alignment empty )

S.Will, 18.417, Fall 2011 Introduction Remarks • • • • xeso oan a otworks Traceback? cost gap affine to dissimilar. Extension i.e. = zero, negative around similar, centered = is positive function similarity that Requires improved be Complexity O oa lgmn Remarks Alignment Local ( n 2 ieadsae gi pc opeiycan complexity space again space, and time )

S.Will, 18.417, Fall 2011 Introduction rcbc:sata aiu nr,taebc ofis entry 0 first to back trace entry, maximum at start Traceback: Example • • s a ( = x , AAC y = ) , b ( = iff − 2 oa lgmn Example Alignment Local ACAA otherwise 3 C A A x 2 0 0 0 0 = y A A C A 2 0 0 0 0 0 0 2 2 114042

S.Will, 18.417, Fall 2011 Introduction o xml,BLOSUM62: example, For • • • • • crsae(cld o d scores: odd log (scaled) are Scores for RNA SUbstitution) for Acid RIBOSUM Amino of (BLOcks proteins BLOSUM for Mutations) Accepted (Percent PAM related closely alignments from multiple learned or matrices sequences similarity use practice: In usiuinSmlrt Matrices Substitution/Similarity log Pr Pr [ x , [ y x | , y Background] | Related]

S.Will, 18.417, Fall 2011 Introduction where K A Definition Example: n column no 3. frec row each for 2. ec entry each 1. utpeainetAo eune a sequences K of A alignment multiple × N mti ( -matrix a a a Sequences (3) (2) (1) = = = A A j i i TCCTACGG ACTACC ACCCGAG otisol a symbols gap only contains , i , j j eeigalgp rm( from gaps all deleting : ) 1 1 ∈ ≤ ≤ utpeAlignment Multiple (Σ i j ≤ ≤ K N {−} ∪ Ni the is (N ) ⇒ align ubro oun fA of columns of number A (1) = A ... i a TCC-TACGG AC--TAC-C ACCCGA-G- Alignment , 1 ( ... K ) A sa is i , N yields ) a ( ) i )

S.Will, 18.417, Fall 2011 Introduction sfrpiws alignment: pairwise for As o ow nwsimilarities? know we do How S Example ( A • • = ) cr ssmoe lgmn columns alignment over sum is Score independently scored are columns Assume s  T A A o oSoeMlil Alignments Multiple Score to How  + s  S C C C (  A + = ) s  X j =1 − C C N  s + ( A s 1  j , . . . , − − C  + A · · · Kj ) + s  G − C 

S.Will, 18.417, Fall 2011 Introduction sfrpiws alignment: pairwise for As rbes a elansmlrte o rpe,4tpe,...? . . . 4-tuples, triples, for similarities learn we Can Problems? define to How S Example ( A • • = ) cr ssmoe lgmn columns alignment over sum is Score independently scored are columns Assume s  T A A o oSoeMlil Alignments Multiple Score to How  s +  s y x z   C C C slgodds log as ?  + S ( s A  = ) − C C  X j + =1 N s s  s   y x z − − C  A A . . .  Kj 1 j =  + log · · · Pr + Pr [ x s , [  y x , , z G − C y | ,  z Background] | Related] ?

S.Will, 18.417, Fall 2011 Introduction Drawbacks? cost) gap affine particular in penalties, gap to for (Extensible scheme scoring used alignments. commonly multiple most the is Sum-of-pairs Idea: prxmt ounsoe ypiws scores pairwise by scores column approximate s u-fPisScore Sum-Of-Pairs  . . . x x 1 j  = 1 ≤ k X < l ≤ K s ( x k , x l )

S.Will, 18.417, Fall 2011 Introduction o eune s -iesoa matrix. K-dimensional Complexity? use sequences K For initialization:) (after sequences 3 For Example Idea: s yai programming dynamic use S i , j , pia utpeAlignment Multiple Optimal k a , max = b , c s -iesoa matrix 3-dimensional use ,                            S S S S S S S i i i i i i i , , − , − − − j j j , − − 1 1 1 1 k , , , , 1 1 − j j j j , , , − , − k k k k 1 1 1 − − , , k k 1 1 − 1 + + + + + + + s s s s s s s ( ( ( ( ( ( ( − − a − a a a i i i i , , , , , , , − b b − − b b j j j j , , , , , , , c − c − c − c k k k k ) ) ) ) ) ) )

S.Will, 18.417, Fall 2011 Introduction ersi utpeAinet rgesv Alignment Progressive Alignment: Multiple Heuristic Idea: h rei nw sgietree. guide as known is tree The - Remarks: sequences 4 Example Dn.W rdcdamlil lgmn of alignment multiple a produced We Done. 5. on?! go 4. first sequences related closely most align 2. related are they how determine 1. g on go 3. a that? do we can How ⇒ ⇒ opt pia lgmnsol pairwise only alignments optimal compute (1) otmly align (optimally) (( e.g. tree, , a piaiyi o guaranteed. not is Optimality - (2) ⇒ a , ⇒ (1) a otmly align (optimally) (3) , otmly lg h w alignments two the align (optimally) a , (2) a a (4) (1) , a . (3) , a a (2) , (1) a (4) ) , and ( a a (3) (3) a , o a egtit? get we can How (2) a and (4) yDP by )) a (4) Why? yDP by

S.Will, 18.417, Fall 2011 Introduction h re fteporsieainetsesi rca o quality! for crucial Heuristics: is steps alignment progressive the the of in order alignments The pairwise of scheme. order alignment the progressive determines tree guide The Cmuepiws itne ewe l nu sequences input all between distances pairwise Compute 1. Cutrsqecsb hi itne,eg by e.g. distances, their by sequences Cluster 2. • • • • ncs,tasomsmlrte odsacs(..Feng-Doolittle) (e.g. distances to similarities transform case, in all against all align egbrJiig(NJ) Joining (UPGMA) Neighbor Method Group Pair Unweighted ud tree Guide

S.Will, 18.417, Fall 2011 Introduction Idea: sequences. two as way same alignments (multiple) Two osqecsfrporsieainetscheme: alignment progressive for Consequences ecnuednmcpormig hc eussoe alignment alignments. over of recurses prefixes which of programming, scores dynamic use can We • • • • omt olcldcsos Oc a,awy gap” a always gap, a “Once decisions. local to Commits resp. s from columns two to similarity Assign Example: columns. alignment of sequence a is alignment An piiainonly Optimization (  G − C  ,  G C TCC-TACGG AC--TAC-C ACCCGA-G-  by ) lgigAlignments Aligning sum-of-pairs local A . and ≡  B . T A A  a eaindb Pi the in DP by aligned be can C C C  − C C  A − − C  and . . . B  e.g. , G − C  .

S.Will, 18.417, Fall 2011 Introduction w IN: ( x • , a y lg TGadCTGG and TTGG Align lg CGadCTGG and ACCG Align cluster and TTGG distances and ACCG edit Align all against all Compute (1) = ) 8 8 5 5 3 2 5 3 4 2 T T 8 5 9 7 7 7 5 8 9 3 G 8 9 2 10 A 7 9 5 8 3 G 2 A 3 5 6 3 9 6 9 7 8 6 G G 11 10 8 8 5 5 4 2 11 6 10 4 C 9 C 8 8 6 7 5 6 4 C C =      8 6 4 2 8 0 6 4 2 0 8 6 4 2 0 tews frmismatch) (for otherwise 3 iff 2 iff 0 ACCG rgesv lgmn Example — Alignment Progressive G G T C G G T T G G T C x x = = , a − y (2) or = y = TTGG − , a (3) lg C n CTGG and TCG Align lg TGadTCG and TTGG Align TCG and ACCG Align 8 5 2 3 2 T 6 6 3 3 2 0 4 5 2 T 8 T 7 9 5 8 3 G 2 A 5 5 5 5 6 G 5 3 8 5 8 5 8 6 G G 6 6 5 3 7 5 6 4 C C 8 5 5 2 4 C = TCG 6 4 2 0 8 6 4 2 0 6 4 2 0 G C T G G T C G C T , a (4) = CTGG

S.Will, 18.417, Fall 2011 Introduction w IN: ( x • , a y a ⇒ cluster and distances ⇒ edit all against all Compute (1) = ) (2) = lse eg UPGMA) (e.g. Cluster matrix distance      and tews frmismatch) (for otherwise 3 iff 2 iff 0 ACCG rgesv lgmn Example — Alignment Progressive x x a (4) = = , a − y (2) r lss,Then: closest, are or = y = TTGG a a a a (4) (3) (2) (1) − a , (1) 8 5 9 0 a (3) = a (2) 3 5 0 TCG a (1) a (3) 5 0 , and a (4) a a = (4) 0 (3) CTGG

S.Will, 18.417, Fall 2011 Introduction w IN: ( x • • • , a y Align alignments! the Align Align opt l gis l dtdsacsadcluster and distances ⇒ edit all against all Compute (1) = ) T16 TT G32 24 GG GG C810 8 TC = ud re(( tree guide      tews frmismatch) (for otherwise 3 iff 2 iff 0 ACCG CTGG TTGG a rgesv lgmn Example — Alignment Progressive 22 28 20 12 4 0 (2) x x = = and , G C C A G C T - and . . . a − y (2) or a . . . a . = (4) -TCG ACCG . y (2) . = TTGG : , − a CTGG TTGG (4) ) , , a ( (3) Align , a (1) = , TCG a (3) a (1) )) , a (4) and = a CTGG (3) : -TCG ACCG

S.Will, 18.417, Fall 2011 Introduction w IN: ( x • • • , a y Align alignments! the Align Align opt l gis l dtdsacsadcluster and distances ⇒ edit all against all Compute (1) = ) T16 TT G32 24 GG GG C810 8 TC = ud re(( tree guide      tews frmismatch) (for otherwise 3 iff 2 iff 0 ACCG CTGG TTGG a rgesv lgmn Example — Alignment Progressive 22 28 20 12 4 0 (2) x x = = and , G C C A G C T - and . . . a − y (2) or a . . . a . = (4) -TCG ACCG . y (2) . = TTGG : , − a CTGG TTGG (4) ) , , a ( (3) Align , a (1) = • • • • • , TCG w w w w w w w w . . . a ( ( ( ( ( ( ( ( (3) T TC T TC − −− T TC a , , , , − A C (1) , , , A )) , −− A CT + ) , + ) + ) + ) A − a − = ) (4) = ) w and = ) w w = ) w ( ( ( ( C C − C = , , , , A C − − + ) a + ) CTGG + ) + ) (3) w w w w ( ( : T ( ( T T − , , , − , T − A -TCG ACCG + ) + ) + ) + ) w w w w ( ( ( C C ( − C , , , , − T − − 6 = ) 10 = ) 8 = ) 4 = )

S.Will, 18.417, Fall 2011 Introduction w IN: ( x • • • , a y Align alignments! the Align Align opt l gis l dtdsacsadcluster and distances ⇒ edit all against all Compute (1) = ) T16 TT G32 24 GG GG C810 8 TC = ud re(( tree guide      tews frmismatch) (for otherwise 3 iff 2 iff 0 ACCG CTGG TTGG a rgesv lgmn Example — Alignment Progressive 22 28 20 12 4 0 (2) x x = = and , G C C A G C T - and . . . a − y (2) or a . . . a . = (4) -TCG ACCG . y (2) . = TTGG : , − a CTGG TTGG (4) ) , , a ( (3) Align , a (1) = , TCG a n traceback and (3) a fe filling after (1) )) , a = (4) and ⇒ = a CTGG (3) : -TCG ACCG -TCG ACCG CTGG TTGG

S.Will, 18.417, Fall 2011 Introduction lsia Approach: Classical A • • • • handling gap for ‘tricks’ special construction tree for joining neighbor cost gap affine with score similarity alignment progressive prototypical LSA W

S.Will, 18.417, Fall 2011 Introduction . lgmn rf n . esiain3)ieaierefinement iterative 3.) reestimation 2.) and draft alignment 1.) dacdPorsieAinetin Alignment Progressive Advanced MUSCLE

S.Will, 18.417, Fall 2011 Introduction ossec-ae crn in scoring Consistency-based • • • • • alignments global and local Merges slide next transformation: consistency of Details consistency global reflect scores Modified extension” “Library scores the modifying by locally optimizing when mistakes Avoid heuristic Consistency + alignment Progressive T-Coffee

S.Will, 18.417, Fall 2011 Introduction l--l lgmnsfrweighting for alignments All-2-all extension library after alignment Correct procedure standard by Misalignment ossec-ae crn in scoring Consistency-based ossec Transformation Consistency • • • goa)consistency (global) towards alignments pairwise guide scores Consistency-based scores into information global moves This edges compatible strengthen triplet: sequence each For T-Coffee

S.Will, 18.417, Fall 2011 Introduction Remarks ACCG- Consensus: TCCGG AC-G- ACCG- ACGG- Alignment • • • h rfiedsrbssqecso h lgmn nargdway. HMMs. rigid profile a requires in insertions/deletions alignment Modeling the of sequences describes profile The column. each for vectors frequency A profile famlil lgmn ossso character of consists alignment multiple a of Profile: T G C A : : : :     lgmn Profiles Alignment 0 0 . . 0 0 75 25         0 0 1 0         0 0 . 0 0 . 25 5         0 1 0 0         0 . 0 0 0 25    

S.Will, 18.417, Fall 2011 Introduction h ete shde,btw a bev h frog.] the that observe Assume can shines. we sun but the hidden, when is likely weather more the ladder the climbs frog [The HMM simple a of Example • • • n a opt hnslk ms rbbept ie an given path . . probable . “most sequence”, like observation things compute distribution. can probability One (encoded) an (e.g. to sequences according observation states. TBTTT) generate change Models to Markov probabilities Hidden specific are there where state, Idea: h rbblt fa bevto eed nahidden a on depends observation an of probability the / 1/3 2/3 T H B T idnMro oes(HMMs) Models Markov Hidden 0.8 S n eal here) details (no 0.4 0.2 / 5/6 1/6 B T H 0.6 R T

S.Will, 18.417, Fall 2011 Introduction Remarks ACCG- Consensus: TCCGG AC-G- ACCG- ACGG- Alignment • • • • • ecm akt Mswe edsusSCFGs. discuss we when HMMs alignments to multiple back construct come to We used be can HMMs Profile sequence (state consensus to (I relation insertion = states (observation hidden alignment sequences multiple of) a distribution in (probability describe HMMs Profile iia osqecso ie lgmn (Pfam, alignment are given that a sequences of for sequences search to to similar used are HMMs Profile rfieHMMs Profile i ,mth(M match ), ≡ ≡ sequence). lgmn string) alignment i ,dlto (D deletion ), HMMer i in ) )

S.Will, 18.417, Fall 2011