Bio2 Sequence Alignment Intro How Do We Do It? BLOSUM 62 Matrix

Sequence Alignment Intro ACCGGTATCCTAGGAC ||| |||| |||||| Bio2 ACC--TATCTTAGGAC • Way of comparing two sequences and assessing the similarity or difference between them Pair-wise Sequence Alignment • Can align DNA or Protein sequences • Matches/substitutions scored from a look-up matrix • Insertion/deletions scored by some gap-penalty formula Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 How do we do it? BLOSUM 62 Matrix • Like everything else there are several methods and choices of parameters • The choice depends on the question being asked – What kind of alignment? – Which substitution matrix is appropriate? – What gap-penalty rules are appropriate? – Is a heuristic method good enough? Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Working Parameters How do we do it? • For proteins, using the affine gap penalty rule and • A Dynamic Programming algorithm is used to a substitution matrix: find the optimal scored alignment (and non- Query Length Matrix Gap (open/extend) optimal scores) – MPSearch <35 PAM-30 9,1 35-50 PAM-70 10,1 • Heuristic approaches improve speed but sacrifice 50-85 BLOSUM-80 10,1 some accuracy >85 BLOSUM-62 11,1 – BLAST – FASTA Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 1 Alignment Types Global Alignment • Global: used to compare to similar sized • Two sequences of similar length sequences. • Finds the best alignment of the two sequences • Finds the score of that alignment • Local: used to find similar subsequences. • Includes ALL bases from both sequences in the alignment and the score. • Ends Free: used to find joins/overlaps. • Needleman-Wunsch algorithm Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Needleman-Wunsch algorithm Needleman-Wunsch algorithm • Gaps are inserted into, or at the ends of each • Consider 2 sequences S and T sequence. • Sequence S has n elements • The sequence length (bases+gaps) are identical for • Sequence T has m elements each sequence • Gap penalty ? • Every base or gap in each sequence is aligned with a base or a gap in the other sequence Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 How do we score gaps? Needleman-Wunsch algorithm ACCGGTATCC---GAC • Consider 2 sequences S and T ||| |||| ||| • Sequence S has n elements ACC--TATCTTAGGAC • Sequence T has m elements • Constant: Length independent weight • Gap penalty –1 per base (arbitrary gap penalty) • Affine: Open and Extend weights. • An alignment between base i in S and a gap in T is • Convex: Each additional gap contributes less represented: (Si,-) • Arbitrary: Some arbitrary function on length • The score for this is represented : σ(Si,-) = -1 – Lets score each gap as –1 times length Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 2 Needleman-Wunsch algorithm Needleman-Wunsch algorithm • Substitution/Match matrix for a simple alignment • Substitution/Match matrix for a simple alignment • Several models based on probability…. • Simple identify matrix (2 for match, -1 for mismatch) A C G T • An alignment between base i in S and base j in T A 2 -1 -1 -1 is represented: (Si,Tj) C -1 2 -1 -1 • The score for this occurring is represented: σ(Si,Tj) G -1 -1 2 -1 T -1 -1 -1 2 Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Needleman-Wunsch algorithm Needleman-Wunsch algorithm • Set up a array V of size n+1 by m+1 – lets start by trying out a simple example alignment: • Row 0 and Column 0 represent the cost of adding gaps to either sequence at the start of the S = ACCGGTAT alignment T = ACCTATC • Calculate the rest of the cells row by row by finding the optimal route from the surrounding cells that represent a gap or match/mismatch – This is easier to demonstrate than to explain Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Needleman-Wunsch algorithm Create array m+1 by n+1 (i.e. 9 by 8) – Get lengths S = ACCGGTAT T = ACCTATC Length of S = m = 8 Length of T = n = 7 (lengths approx equal so OK for Global Alignment) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 3 Add on bases from each sequence Represent scores for gaps in row/col 0 A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 A A C C C C T T A A T T C C (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Represent scores for gaps in row/col 0 For each cell consider the ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 A -1 C -2 C -2 C -3 C -3 T -4 T -4 A -5 A -5 T -6 T -6 C -7 C -7 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 For each cell consider the ‘best’ path For each cell consider the ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 A -1 A -1 C C (S1,T0) & σ(-,T1) = -1 (S1,T0) & σ(-,T1) = -1 C Running total (-1+-1)=-2 C Running total (-1+-1)=-2 T T A A T T C C (S0, T1) & σ(S1,-) = -1 (T) (T) Running total (-1+-1)=-2 Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 4 For each cell consider the ‘best’ path Choose and record ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 A -1 A -1 2 C C (S1,T0) & σ(-,T1) = -1 C Running total (-1+-1)=-2 C T T (S0,T0) & σ(S1,T1) = 2 A Running total (0+2)=2 A T T C C (S0, T1) & σ(S1,-) = -1 (T) Running total (-1+-1)=-2 (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Choose and record ‘best’ path Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 A -1 2 1 0 -1 -2 -3 -4 C C -2 (S2,T0) & σ(-,T1) C Running total (-2+-1)=-3 C -3 T T -4 (S ,T ) & (S ,T ) A 1 0 σ 2 1 A -5 Running total (-1+-1)=-2 T T -6 C C (S1,T1) & σ(S2,-) -7 (T) Running total (2+-1)=1 (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Continue…. Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 C -3 0 3 6 5 4 3 2 1 T -4 T -4 A -5 A -5 T -6 T -6 C -7 C -7 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 5 Continue…. Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 A -5 -2 1 4 4 3 5 8 7 T -6 T -6 C -7 C -7 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 Continue…. Finally. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 C -7 -4 -1 2 2 2 4 6 9 = Score (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 We recreate the alignment using by following the pointers Finally. back through the array to the origin A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 6 - (S) T- (S) | C (T) TC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 AT- (S) TAT- (S) || ||| ATC (T) TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 GTAT- (S) GGTAT- (S) ||| ||| -TATC (T) --TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 7 CGGTAT- (S) CCGGTAT- (S) | ||| || ||| C--TATC (T) CC--TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2 ACCGGTAT- (S) ||| ||| Checking the result ACC--TATC (T) A C C G G T A T (S) ACCGGTAT- (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 ||| ||| A -1 2 1 0 -1 -2 -3 -4 -5 ACC--TATC (T) C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 • Our alignment considers ALL bases in each T -4 -1 2 5 4 4 6 5 4 sequence A -5 -2 1 4 4 3 5 8 7 • 6 matches = 12 points, 3 gaps = -3 points T -6 -3 0 3 3 3 5 7 10 • Score = 9 confirmed.

Bio2 Sequence Alignment Intro How Do We Do It? BLOSUM 62 Matrix

Bioinformatics 1: Lecture 3

BASS: Approximate Search on Large String Databases

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

B.Sc. (Hons.) Biotech BIOT 3013 Unit-5 Satarudra P Singh

Parameter Advising for Multiple Sequence Alignment

Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices

Deriving Amino Acid Exchange Matrices (II) and Multiple Sequence Alignment (I) Summarysummary Dayhoff’Sdayhoff’S PAMPAM--Matricesmatrices

Pairwise Sequence Alignment Algorithm by a New Measure Based

Scoring Matrices for Sequence Comparisons

Rotamer-Specific Statistical Potentials for Protein Structure Modeling

Multiple Sequence Alignment

Practical Considerations of Working with Sequencing Data File Types