<<

Intro

ACCGGTATCCTAGGAC ||| |||| |||||| Bio2 ACC--TATCTTAGGAC • Way of comparing two sequences and assessing the similarity or difference between them Pair-wise Sequence Alignment • Can align DNA or sequences • Matches/substitutions scored from a look-up matrix • Insertion/deletions scored by some gap-penalty formula

Armstrong, 2005 2 Armstrong, 2005 BioInformatics 2

How do we do it? BLOSUM 62 Matrix

• Like everything else there are several methods and choices of parameters • The choice depends on the question being asked – What kind of alignment? – Which is appropriate? – What gap-penalty rules are appropriate? – Is a heuristic method good enough?

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Working Parameters How do we do it?

• For , using the affine gap penalty rule and • A Dynamic Programming algorithm is used to a substitution matrix: find the optimal scored alignment (and non- Query Length Matrix Gap (open/extend) optimal scores) – MPSearch <35 PAM-30 9,1 35-50 PAM-70 10,1 • Heuristic approaches improve speed but sacrifice 50-85 BLOSUM-80 10,1 some accuracy >85 BLOSUM-62 11,1 – BLAST – FASTA

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

1 Alignment Types Global Alignment

• Global: used to compare to similar sized • Two sequences of similar length sequences. • Finds the best alignment of the two sequences • Finds the score of that alignment • Local: used to find similar subsequences. • Includes ALL bases from both sequences in the alignment and the score. • Ends Free: used to find joins/overlaps. • Needleman-Wunsch algorithm

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Needleman-Wunsch algorithm Needleman-Wunsch algorithm

• Gaps are inserted into, or at the ends of each • Consider 2 sequences S and T sequence. • Sequence S has n elements • The sequence length (bases+gaps) are identical for • Sequence T has m elements each sequence • Gap penalty ? • Every base or gap in each sequence is aligned with a base or a gap in the other sequence

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

How do we score gaps? Needleman-Wunsch algorithm

ACCGGTATCC---GAC • Consider 2 sequences S and T ||| |||| ||| • Sequence S has n elements ACC--TATCTTAGGAC • Sequence T has m elements • Constant: Length independent weight • Gap penalty –1 per base (arbitrary gap penalty) • Affine: Open and Extend weights. • An alignment between base i in S and a gap in T is • Convex: Each additional gap contributes less represented: (Si,-) • Arbitrary: Some arbitrary function on length • The score for this is represented : σ(Si,-) = -1 – Lets score each gap as –1 times length

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

2 Needleman-Wunsch algorithm Needleman-Wunsch algorithm

• Substitution/Match matrix for a simple alignment • Substitution/Match matrix for a simple alignment • Several models based on probability…. • Simple identify matrix (2 for match, -1 for mismatch) A C G T • An alignment between base i in S and base j in T A 2 -1 -1 -1 is represented: (Si,Tj) C -1 2 -1 -1 • The score for this occurring is represented: σ(Si,Tj) G -1 -1 2 -1 T -1 -1 -1 2

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Needleman-Wunsch algorithm Needleman-Wunsch algorithm

• Set up a array V of size n+1 by m+1 – lets start by trying out a simple example alignment: • Row 0 and Column 0 represent the cost of adding gaps to either sequence at the start of the S = ACCGGTAT alignment T = ACCTATC • Calculate the rest of the cells row by row by finding the optimal route from the surrounding cells that represent a gap or match/mismatch – This is easier to demonstrate than to explain

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Needleman-Wunsch algorithm Create array m+1 by n+1 (i.e. 9 by 8) – Get lengths

S = ACCGGTAT T = ACCTATC Length of S = m = 8 Length of T = n = 7 (lengths approx equal so OK for Global Alignment)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

3 Add on bases from each sequence Represent scores for gaps in row/col 0 A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 A A C C C C T T A A T T C C (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Represent scores for gaps in row/col 0 For each consider the ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 A -1 C -2 C -2 C -3 C -3 T -4 T -4 A -5 A -5 T -6 T -6 C -7 C -7 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

For each cell consider the ‘best’ path For each cell consider the ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 A -1 A -1 C C (S1,T0) & σ(-,T1) = -1 (S1,T0) & σ(-,T1) = -1 C Running total (-1+-1)=-2 C Running total (-1+-1)=-2 T T A A T T C C (S0, T1) & σ(S1,-) = -1 (T) (T) Running total (-1+-1)=-2

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

4 For each cell consider the ‘best’ path Choose and record ‘best’ path A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 A -1 A -1 2 C C (S1,T0) & σ(-,T1) = -1 C Running total (-1+-1)=-2 C T T (S0,T0) & σ(S1,T1) = 2 A Running total (0+2)=2 A T T C C (S0, T1) & σ(S1,-) = -1 (T) Running total (-1+-1)=-2 (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Choose and record ‘best’ path Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 A -1 2 1 0 -1 -2 -3 -4 C C -2 (S2,T0) & σ(-,T1) C Running total (-2+-1)=-3 C -3 T T -4 (S ,T ) & (S ,T ) A 1 0 σ 2 1 A -5 Running total (-1+-1)=-2 T T -6 C C (S1,T1) & σ(S2,-) -7 (T) Running total (2+-1)=1 (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Continue…. Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 C -3 0 3 6 5 4 3 2 1 T -4 T -4 A -5 A -5 T -6 T -6 C -7 C -7 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

5 Continue…. Continue…. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 A -5 -2 1 4 4 3 5 8 7 T -6 T -6 C -7 C -7 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Continue…. Finally. A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 C -7 -4 -1 2 2 2 4 6 9 = Score (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

We recreate the alignment using by following the pointers Finally. back through the array to the origin A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

6 - (S) T- (S) | C (T) TC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

AT- (S) TAT- (S) || ||| ATC (T) TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

GTAT- (S) GGTAT- (S) ||| ||| -TATC (T) --TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

7 CGGTAT- (S) CCGGTAT- (S) | ||| || ||| C--TATC (T) CC--TATC (T) A C C G G T A T (S) A C C G G T A T (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 0 -1 -2 -3 -4 -5 -6 -7 -8 A -1 2 1 0 -1 -2 -3 -4 -5 A -1 2 1 0 -1 -2 -3 -4 -5 C -2 1 4 3 2 1 0 -1 -2 C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 C -3 0 3 6 5 4 3 2 1 T -4 -1 2 5 4 4 6 5 4 T -4 -1 2 5 4 4 6 5 4 A -5 -2 1 4 4 3 5 8 7 A -5 -2 1 4 4 3 5 8 7 T -6 -3 0 3 3 3 5 7 10 T -6 -3 0 3 3 3 5 7 10 C -7 -4 -1 2 2 2 4 6 9 C -7 -4 -1 2 2 2 4 6 9 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

ACCGGTAT- (S) ||| ||| Checking the result ACC--TATC (T) A C C G G T A T (S) ACCGGTAT- (S) 0 -1 -2 -3 -4 -5 -6 -7 -8 ||| ||| A -1 2 1 0 -1 -2 -3 -4 -5 ACC--TATC (T) C -2 1 4 3 2 1 0 -1 -2 C -3 0 3 6 5 4 3 2 1 • Our alignment considers ALL bases in each T -4 -1 2 5 4 4 6 5 4 sequence A -5 -2 1 4 4 3 5 8 7 • 6 matches = 12 points, 3 gaps = -3 points T -6 -3 0 3 3 3 5 7 10 • Score = 9 confirmed. C -7 -4 -1 2 2 2 4 6 9 (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

A bit more formally.. Time Complexity Base conditions: i V(i,0) = ∑ σ(Sk,-) • Each cell is dependant on three others and the two k=0 relevant characters in each sequence j • Hence each cell takes a constant time V(0,j) = ∑ σ(-,Tk) k=0 • (n+1) x (m+1) cells Recurrence relation: for 1<=i <= n, 1<=j<=m:

V(i-1,j-1) + σ(Si,Tj) • Complexity is therefore O(nm) V(i,j) = max { V(i-1,j) + σ(Si,-) V(i,j-1) + σ(-,Tj)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

8 Space Complexity Global alignment in linear space

• To calculate each row we need the current row • Hirschberg 1977 applied a ‘divide and conquer’ and the row above only. algorithm to Global Alignment to solve the • Therefore to get the score, we need O(n+m) space problem in linear space. • Divide the problem into small manageable chunks • However, if we need the pointers as well, this • The clever bit is finding the chunks increases to O(nm) space • This is a problem for very long sequences – think about the size of whole genomes

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

dividing... Hirschberg’s divide and conquer approach (0,0) n th Compute matrix V(A,B) saving the values for /2 row - call this matrix F r r n th Compute matrix V(A ,B ) saving the values for /2 row - call this matrix B n /2 n Find column k so that the crossing point ( /2,k) satisfies: n n F( /2,k) + B( /2,m-k) = F(n,m)

Now we have two much smaller problems: n n (0,0) -> ( /2,k) and (n,m) -> ( /2,m-k) (m,n)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Complexity OK where are we?

• After applying Hirschberg’s divide and conquer approach • The Needleman-Wunsch algorithm finds the we get the following: optimum alignment and the best score. – Complexity O(mn) – NW is a dynamic programming algorithm – Space O(min(m,n)) • Space complexity is a problem with NW • Addressed by a divide and conquer algorithm • For the proofs, see D.S. Hirschberg. (1977) Algorithms for • What about local and ends-free alignments? the longest common subsequence problem. J. A.C.M 24: 664-667

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

9 Smith-Waterman algorithm Smith-Waterman algorithm

• Between two sequences, find the best two • If Si matches Tj then σ(Si,Tj) >=0 subsequences and their score. • If they do not match or represent a gap then <=0 • We want to ignore badly matched sequence • Use the same types of substitution matrix and gap • Lowest allowable value of any cell is 0 penalties • Find the cell with the highest value (i,j) and • Use a modification of the previous dynamic extend the alignment back to the first zero value programming approach. • The score of the alignment is the value in that cell • A quick example if best...

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

min value of any cell is 0 min value of any cell is 0

A C C G G T A T (S) A C C G G T A T (S) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T 0 T 0 0 0 0 0 0 2 1 2 T 0 T 0 0 0 0 0 0 2 1 3 G 0 G 0 T 0 T 0 A 0 A 0 T 0 T 0 C 0 C 0 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

min value of any cell is 0 Find biggest cell and map alignment from there

A C C G G T A T (S) A C C G G T A T (S) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T 0 0 0 0 0 0 2 1 2 T 0 0 0 0 0 0 2 1 2 T 0 0 0 0 0 0 2 1 3 T 0 0 0 0 0 0 2 1 3 G 0 0 0 0 2 2 1 1 2 G 0 0 0 0 2 2 1 1 2 T 0 0 0 0 1 1 4 3 3 T 0 0 0 0 1 1 4 3 3 A 0 2 1 0 0 0 3 6 5 A 0 2 1 0 0 0 3 6 5 T 0 1 1 0 0 0 2 5 8 T 0 1 1 0 0 0 2 5 8 C 0 0 3 4 3 2 1 4 7 C 0 0 3 4 3 2 1 4 7 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

10 GTAT(S) |||| Smith-Waterman cont’d GTAT(T) A C C G G T A T (S) 0 0 0 0 0 0 0 0 0 • Complexity T 0 0 0 0 0 0 2 1 2 – Time is O(nm) as in global alignments T 0 0 0 0 0 0 2 1 3 – Space is O(nm) as in global alignments G 0 0 0 0 2 2 1 1 2 T 0 0 0 0 1 1 4 3 3 – A mod of Hirschbergs algorithm allows O(n+m) (n+m) as two rows need to be stored at a time instead of A 0 2 1 0 0 0 3 6 5 one as in the global alignment. T 0 1 1 0 0 0 2 5 8 C 0 0 3 4 3 2 1 4 7 (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

A bit more formally.. Ends-free alignment Base conditions: ∀i,j. V(i,0) = 0, V(0,j) = 0 • Find the overlap between two sequences such start the start of one overlaps is in the alignment and Recurrence relation: for 1<=i <= n, 1<=j<=m: the end of the other is in the alignment. 0 • Essential to DNA sequencing strategies. V(i-1,j-1) + σ(Si,Tj) V(i,j) = max – Building genome fragments out of shorter sequencing { V(i-1,j) + σ(Si,-) data. V(i,j-1) + σ(-,Tj) * * * * • Another variant of the Global Alignment Problem Compute i and j V(i ,j ) = max 1<=i<=n,1<=j<=m V(i,j)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Ends-free alignment min value row0 & col0 is 0 G T T A C T G T (S) • Set the initial conditions to zero weight 0 0 0 0 0 0 0 0 0 – allow indels/gaps at the ends without penalty C 0 -1 -1 -1 -1 2 1 0 -1 • Fill the array/table using the same recursion model T 0 -1 1 1 0 1 4 3 2 used in global/local alignment G 0 2 1 0 0 0 3 6 5 • Find the best alignment that ends in one row or T 0 1 4 3 2 1 2 5 8 column A 0 0 3 3 5 4 3 4 7 – trace this back T 0 -1 2 5 4 4 6 5 6 C 0 0 1 4 4 6 5 5 5 (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

11 Find the best ‘end’ point in an end col or row Trace the best route from there to the origin and end

G T T A C T G T (S) G T T A C T G T (S) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 -1 -1 -1 -1 2 1 0 -1 C 0 -1 -1 -1 -1 2 1 0 -1 T 0 -1 1 1 0 1 4 3 2 T 0 -1 1 1 0 1 4 3 2 G 0 2 1 0 0 0 3 6 5 G 0 2 1 0 0 0 3 6 5 T 0 1 4 3 2 1 2 5 8 T 0 1 4 3 2 1 2 5 8 A 0 0 3 3 5 4 3 4 7 A 0 0 3 3 5 4 3 4 7 T 0 -1 2 5 4 4 6 5 6 T 0 -1 2 5 4 4 6 5 6 C 0 0 1 4 4 6 5 5 5 C 0 0 1 4 4 6 5 5 5 (T) (T)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

GTTACTGT---(S) |||| A bit more formally.. ----CTGTATC(T) G T T A C T G T (S) Base conditions: ∀i,j. V(i,0) = 0, V(0,j) = 0 0 0 0 0 0 0 0 0 0 C 0 -1 -1 -1 -1 2 1 0 -1 Recurrence relation: for 1<=i <= n, 1<=j<=m: T 0 -1 1 1 0 1 4 3 2 V(i-1,j-1) + σ(Si,Tj) G 0 2 1 0 0 0 3 6 5 V(i,j) = max V(i-1,j) + σ(Si,-) { V(i,j-1) + σ(-,T ) T 0 1 4 3 2 1 2 5 8 j

A 0 0 3 3 5 4 3 4 7 Search for i* such that: V(i*,m)=max1<=i<=n,m V(i,j) T 0 -1 2 5 4 4 6 5 6 Search for j* such that: V(n,j*)=max1<=j<=n,m V(i,j) C 0 0 1 4 4 6 5 5 5 V(n,j*) (T) Define alignment score V(S,T) = max {V(i*,m) Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Summary so far... Dynamic Programming Issues

• Dynamic programming algorithms can solve • For huge sequences, even linear space constraints global, local and ends-free alignment are a problem. • They give the optimum score and alignment using • We used a very simple gap penalty the parameters given • The Affine Gap penalty is most commonly used. • Divide and conquer approaches make the space – Cost to open a gap complexity manageable for small-medium sized – Cost to extend an open gap sequences • Need to track and evaluate the ‘gap’ state in the array

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

12 Tracking the gap state Tracking the gap state

• We can model the matches and gap insertions as a • Working along the alignment process... finite state machine:

Taken from Durbin, chapter 2.4 Taken from Durbin, chapter 2.4

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Real Life Sequence Alignment Real Life Sequence Alignment

• When searching multiple genomes, the sizes still • Use a Heuristic Method get too big! – Faster than ‘exact’ algorithms • Several approaches have been tried: – Give an approximate solution • Use huge parallel hardware: – Software based therefore cheap – Distribute the problem over many CPUs – Very expensive • Based on a number of assumptions: • Implement in Hardware – Cost of specialist boards is high – Has been done for Smith-Waterman on SUN Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Assumptions for Heuristic Approaches Conclusions

• Even linear time complexity is a problem for large • Dynamic programming algorithms are expensive genomes but they give you the optimum alignment and • Databases can often be pre-processed to a degree exact score • Substitutions more likely than gaps • Choice of GAP penalty and substitution matrix are • Homologous sequences contain a lot of critically important substitutions without gaps which can be used to • Heuristic approaches are generally required for help find start points in alignments high throughput or very large alignments

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

13 Heuristic Methods Assumptions for Heuristic Approaches

• FASTA • Even linear time complexity is a problem for large • BLAST genomes • Gapped BLAST • Databases can often be pre-processed to a degree • PSI-BLAST • Substitutions more likely than gaps • Homologous sequences contain a lot of substitutions without gaps which can be used to help find start points in alignments

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

FASTA Dot-plot alignment a a g t c c c g t g a Lipman and Pearson (1988) Improved tools for biological sequence • We can find good comparison. PNAS 85: 10915-10919 g subsequences just by g • Compares a query string against a single text string (i.e. for looking for diagonal t sequence databases, lots of searches) runs of matched c • Based on the assumption that good local alignment is bases: c likely to have some exact matching subsequences g • The algorithm looks for these subsequences first. t t c

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

Dot-plot alignment Dot-plot alignment a a g t c c c g t g a a g t c c c g t g a * * a * * • We can find good g * * * • We can find good g * * * subsequences just by g * * * subsequences just by g * * * looking for diagonal t * * looking for diagonal t * * runs of matched c * * * runs of matched c * * * bases: c * * * bases: c * * * g * * * g * * * t * * t * * • Mark identical hits t * * • Find Diagonal Runs: t * * c * * * c * * *

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

14 Dot-plot alignment FASTA Definitions a a g t c c c g t g a * * • We can find good g * * * • ktup: subsequences just by g * * * – (k respective tuples) – an integer value which specifies looking for diagonal t * * the word length used to find matching substrings runs of matched c * * * – Standard 4-6 for DNA bases: c * * * – Standard 1 or 2 for proteins g * * * – Shorter is more sensitive but slower t * * • Compare to DP – Target databases can be preprocessed into ktup sized t * * chunks before queries are run. alignment: c * * *

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

FASTA Definitions FASTA Definitions

• hot spots: • init1: – The matching ktup length substrings – The best scoring run – Consecutive hot-spots are located along the diagonal • initn: – See dot-plot for example of 4 length hotspots – The best local alignment – Often close to the dynamic programming solution – Combination of good diagonal runs and indels/gaps • diagonal run: between them. – A sequence of nearby hot-spots on the same diagonal – i.e. spaces between hot-spots are allowed

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

FASTA Process FASTA Process

1. Look for hot-spots: 2. Find best diagonal runs: • The stage can be done by using a look-up table or • Each hot spot gets a positive score. a hash. • Distance between hot spots is negative and length • Pre-process the database and store the location of dependant each possible ktup (AA=202, DNA=46) • Score of the diagonal run • Move a ktup sized window along the query • Fasta finds and stores the 10 best diagonal runs sequence and record the position of matching locations in the database.

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

15 FASTA Process FASTA Process

3. Compute init1 & filter: 4. Combine diagonal runs and compute initn: • Diagonal runs specify a potential alignment • Take the ‘good alignments’ from previous stage • Evaluate properly using a substitution matrix • Now allow gaps/indels • Define the best scoring run as init • Combine them into a single, better scoring 1 alignment • Discard any much lower scoring runs – Construct a directed weighted graph • vertices are the runs • edge weights represent gap penalties

– Find the best path through the graph = initn

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

FASTA Process FASTA Process

5. Find the best local alignment 6. Compare the alignments

• Use the ‘alignments’ from the previous stage to • Take the opt or initn scores for each sequence in define a narrow band through the search space the database • Go through that band using a dynamic • Rank according to score programming approach • Use a full dynamic programming algorithm to • Size of the band is dependant on ktup value align the query sequence with the highest ranking • The best local alignment found in this stage is result sequences called opt

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

FASTA Programs

• fasta3 scan a protein or DNA sequence library for similar sequences • fastax/y3 compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames • tfastax/y3 compares a protein to a translated DNA data bank • fasts3 compares linked peptides to a protein databank • fastf3 compares mixed peptides to a protein databank

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

16 FASTA Summary BLAST

• The alignment produced is not always optimal Altschul, Gish, Miller, Myers and Lipman (1990) Basic local alignment search tool. J Mol Biol 215:403-410 • The resulting scores usually compare very well with the dynamic programming solutions • Developed on the ideas of FASTA • Integrates the substitution matrix in the first stage of finding the hot spots • FASTA is much faster than ordinary dynamic programming algorithms • Faster hot spot finding

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

BLAST definitions BLAST Process

• Given two strings S1 and S2 • Parameters: • A segment pair is a pair of equal lengths – w: word length (substrings) substrings of S1 and S2 aligned without gaps – t: threshold for selecting interesting alignment scores • A locally maximal segment is a segment whose alignment score (without gaps) cannot be improved by extending or shortening it.

• A maximum segment pair (MSP) in S1 and S2 is a segment pair with the maximum score over all segment pairs.

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

BLAST Process BLAST Process

• 1. Find all the w-length substrings from the • 2. Extend hits: database with an alignment score >t – extend each hit to a local maximal segment – Each of these (similar to a hot spot in FASTA) is called – extension of initial w size hit may increase or decrease a hit the score – Does not have to be identical – terminate extension when a threshold is exceeded – Scored using substitution matrix and score compared to – find the best ones (HSP) the threshold t (which determines number found) – Words size can therefore be longer without losing • This first version of Blast did not allow gaps…. sensitivity: AA - 3-7 and DNA ~12

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

17 (Improved) BLAST (Improved) BLAST Process

Altshul, Madden, Schaffer, Zhang, Zhang, Miller & Lipman • Find words or hot-spots (1997) Gapped BLAST and PSI-BLAST:a new generation – search each diagonal for two w length words such that of protein database search programs. Nucleic Acids score >=t Research 25:3389-3402 – future expansion is restricted to just these initial words • Improved algorithms allowing gaps – we reduce the threshold t to allow more initial words to – these have superceded the older version of BLAST progress to the next stage – two versions: Gapped and PSI BLAST

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

(Improved) BLAST Process PSI-BLAST

• Allow local alignments with gaps • Iterative version of BLAST for searching for – allow the words to merge by introducing gaps protein domains – each new alignment is comprises two words with a – Uses a dynamic substitution matrix number of gaps – Start with a normal – unlike FASTA does not restrict the search to a narrow – Take the results and use these to ‘tweak’ the matrix band – Re-run the blast search until no new matches occur – as only two word hits are expanded this makes the new • Good for finding distantly related sequences but blast about 3x faster high frequency of false-positive hits

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

BLAST Programs

• blastp compares an query sequence against a protein sequence database. • blastn compares a nucleotide query sequence against a nucleotide sequence database. • blastxcompares a nucleotide query sequence translated in all reading frames against a protein sequence database. • tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. • tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. (SLOW)

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

18 Go try them out! Alignment Heuristics

• Links to NCBI and EBI are on the course web site • Dynamic Programming is better but too slow • FASTA and BLAST based on several assumptions • Some test sequences will be posted on the course about good alignments web site – substitutions more likely than gaps – good alignments have runs of identical matches • FASTA good for DNA sequences but slower • BLAST better for amino acid sequences and pretty good for DNA, fastest.

Armstrong, 2005 BioInformatics 2 Armstrong, 2005 BioInformatics 2

19