Bio2 Sequence Alignment Intro How Do We Do It? BLOSUM 62 Matrix

Bio2 Sequence Alignment Intro How Do We Do It? BLOSUM 62 Matrix

Sequence Alignment Intro ACCGGTATCCTAGGAC ||| |||| |||||| Bio2 ACC--TATCTTAGGAC • Way of comparing two sequences and assessing the similarity or difference between them Pair-wise Sequence Alignment • Can align DNA or Protein sequences • Matches/substitutions scored from a look-up matrix • Insertion/deletions scored by some gap-penalty formula Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 How do we do it? BLOSUM 62 Matrix • Like everything else there are several methods and choices of parameters • The choice depends on the question being asked – What kind of alignment? – Which substitution matrix is appropriate? – What gap-penalty rules are appropriate? – Is a heuristic method good enough? Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Working Parameters How do we do it? • For proteins, using the affine gap penalty rule and • A Dynamic Programming algorithm is used to a substitution matrix: find the optimal scored alignment (and non- Query Length Matrix Gap (open/extend) optimal scores) – MPSearch <35 PAM-30 9,1 35-50 PAM-70 10,1 • Heuristic approaches improve speed but sacrifice 50-85 BLOSUM-80 10,1 some accuracy >85 BLOSUM-62 11,1 – BLAST – FASTA Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 1 Alignment Types Global Alignment • Global: used to compare to similar sized • Two sequences of similar length sequences. • Finds the best alignment of the two sequences • Finds the score of that alignment • Local: used to find similar subsequences. • Includes ALL bases from both sequences in the alignment and the score. • Ends Free: used to find joins/overlaps. • Needleman-Wunsch algorithm Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Needleman-Wunsch algorithm Needleman-Wunsch algorithm • Gaps are inserted into, or at the ends of each • Consider 2 sequences S and T sequence. • Sequence S has n elements • The sequence length (bases+gaps) are identical for • Sequence T has m elements each sequence • Gap penalty ? • Every base or gap in each sequence is aligned with a base or a gap in the other sequence Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 How do we score gaps? Needleman-Wunsch algorithm ACCGGTATCC---GAC • Consider 2 sequences S and T ||| |||| ||| • Sequence S has n elements ACC--TATCTTAGGAC • Sequence T has m elements • Constant: Length independent weight • Gap penalty –1 per base (arbitrary gap penalty) • Affine: Open and Extend weights. • An alignment between base i in S and a gap in T is • Convex: Each additional gap contributes less represented: (Si,-) • Arbitrary: Some arbitrary function on length • The score for this is represented : σ(Si,-) = -1 – Lets score each gap as –1 times length Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 2 Needleman-Wunsch algorithm Needleman-Wunsch algorithm • Substitution/Match matrix for a simple alignment • Substitution/Match matrix for a simple alignment • Several models based on probability…. • Simple identify matrix (2 for match, -1 for mismatch) A C G T • An alignment between base i in S and base j in T A 2 -1 -1 -1 is represented: (Si,Tj) C -1 2 -1 -1 • The score for this occurring is represented: σ(Si,Tj) G -1 -1 2 -1 T -1 -1 -1 2 Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Needleman-Wunsch algorithm Needleman-Wunsch algorithm • Set up a array V of size n+1 by m+1 – lets start by trying out a simple example alignment: • Row 0 and Column 0 represent the cost of adding gaps to either sequence at the start of the S = ACCGGTAT alignment T = ACCTATC • Calculate the rest of the cells row by row by finding the optimal route from the surrounding cells that represent a gap or match/mismatch – This is easier to demonstrate than to explain Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Needleman-Wunsch algorithm Create array m+1 by n+1 (i.e. 9 by 8) – Get lengths S = ACCGGTAT T = ACCTATC Length of S = m = 8 Length of T = n = 7 (lengths approx equal so OK for Global Alignment) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 3 Add on bases from each sequence Represent scores for gaps in row/col 0 A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 A A C C C C T T A A T T C C (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Represent scores for gaps in row/col 0 For each cell consider the ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 A -1 C -2 C -2 C -3 C -3 T -4 T -4 A -5 A -5 T -6 T -6 C -7 C -7 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 For each cell consider the ‘best’ path For each cell consider the ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 A -1 A -1 C C (S1,T0) & σ(-,T1) = -1 (S1,T0) & σ(-,T1) = -1 C Running total (-1+-1)=-2 C Running total (-1+-1)=-2 T T A A T T C C (S0, T1) & σ(S1,-) = -1 (T) (T) Running total (-1+-1)=-2 Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 4 For each cell consider the ‘best’ path Choose and record ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 A -1 A -1 2 C C (S1,T0) & σ(-,T1) = -1 C Running total (-1+-1)=-2 C T T (S0,T0) & σ(S1,T1) = 2 A Running total (0+2)=2 A T T C C (S0, T1) & σ(S1,-) = -1 (T) Running total (-1+-1)=-2 (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Choose and record ‘best’ path Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 A -1 2 1 0 -1 -2 -3 -4 C C -2 (S2,T0) & σ(-,T1) C Running total (-2+-1)=-3 C -3 T T -4 (S ,T ) & (S ,T ) A 1 0 σ 2 1 A -5 Running total (-1+-1)=-2 T T -6 C C (S1,T1) & σ(S2,-) -7 (T) Running total (2+-1)=1 (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Continue…. Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 C -3 0 3 6 5 4 3 2 1 T -4 T -4 A -5 A -5 T -6 T -6 C -7 C -7 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 5 Continue…. Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 A -5 -2 1 4 4 3 5 8 7 T -6 T -6 C -7 C -7 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Continue…. Finally. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 C -7 -4 -1 2 2 2 4 6 9 = Score (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 We recreate the alignment using by following the pointers Finally. back through the array to the origin A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 6 - (S) T- (S) | C (T) TC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 AT- (S) TAT- (S) || ||| ATC (T) TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 GTAT- (S) GGTAT- (S) ||| ||| -TATC (T) --TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 7 CGGTAT- (S) CCGGTAT- (S) | ||| || ||| C--TATC (T) CC--TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 ACCGGTAT- (S) ||| ||| Checking the result ACC--TATC (T) A C C G G T A T (S) ACCGGTAT- (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 ||| ||| A -1 2 1 0 -1 -2 -3 -4 -5 ACC--TATC (T) C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 • Our alignment considers ALL bases in each T -4 -1 2 5 4 4 6 5 4 sequence A -5 -2 1 4 4 3 5 8 7 • 6 matches = 12 points, 3 gaps = -3 points T -6 -3 0 3 3 3 5 7 10 • Score = 9 confirmed.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    19 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us