Sequence Alignment Heuristics

Some slides from: • Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt • Serafim Batzoglu, Stanford http://ai.stanford.edu/~serafim/ • Geoffrey J. Barton, Oxford “Protein Sequence Alignment and Database Scanning” http://www.compbio.dundee.ac.uk/ftp/preprints/review93/review93.pdf CG © Ron Shamir, 09 Why Heuristics ?

• Motivation: – Dynamic programming guarantees an optimal solution & is efficient, but – Not fast enough when searching a database of size ~1012, with a query of length 200-500bp • Solutions: – Implement on hardware. (COMPUGEN) – Parallel hardware. (MASSPAR) – Ad-hoc implementations using specific hardware. – Use faster heuristic algorithms. • Common Heuristics: FASTA, BLAST

CG © Ron Shamir, 09 GenBank Growth

CG © Ron Shamir, 09 www.ncbi.nlm.nih.gov/Genbank/benbanstats.html GenBank Growth

CG © Ron Shamir, 09 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

October 15 2009: 108,560,236,506 bases

CG © Ron Shamir, 09 Key observations

• Even O(m+n) time would be problematic when db size is huge • Substitutions are much more likely than indels • Homologous sequences contain many matches • Numerous queries are run on the same db  Preprocessing of the db is desirable

CG © Ron Shamir, 09 Indexing-based local alignment

Dictionary: …… All words of length k (~10) query Alignment initiated between words of alignment score  T

…… Alignment: Ungapped extensions until score scan below statistical threshold DB

Output: All local alignments with score > statistical threshold query

CS262 Lecture 3, Win06, Batzoglou Detour: Banded Alignment

Assume we know that x and y are very similar

Assumption: # gaps(x, y) < k(N) ( say N>M )

xi Then, | implies | i – j | < k(N)

yj

We can align x and y more efficiently:

Time, Space: O(N  k(N)) << O(N2)

CS262 Lecture 2, Win06, Batzoglou Banded Alignment

Initialization:

x1 ………………………… xM F(i,0), F(0,j) undefined for i, j > k 1

Iteration:

For i = 1…M For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ s(xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k(N) ………………………… ………………………… y

F(i – 1, j) – d, if j < i + k(N) N y k(N) Termination: same

Easy to extend to the affine gap case

CS262 Lecture 2, Win06, Batzoglou Alignment Dot-Plot Matrix

a a g t c c c g t g a * * g * * * g * * t * * c * * * c * * * g * * * t * * t * *

CG © Ron Shamir,c 09 * * * Dot plots Example 1: close protein homologs (man and mouse)

CG © Ron Shamir, 09 www.bioinfo.rpi.edu/~zukerm/Bio-5495/ Example 2: remote protein homologs (man and bacilus)

CG © Ron Shamir, 09 Example 1: dot for 4+ matches in window of 5

CG © Ron Shamir, 09 Example 2: dot for 4+ matches in window of 5

CG © Ron Shamir, 09 FASTA : A Heuristic Method for Sequence Comparison • History: Lipman and Pearson in 1985, 1988 • Key idea: Good local alignment must have exact matching subsequences. • Algorithm Evaluation: – Resulting alignment scores well compared to the optimal alignment (shown experimentally) – Much faster than dynamic programming.

CG © Ron Shamir, 09 Disclaimer

• Highly popular software tools get numerous updates, revisions, versions, variants etc. • Implementation details differ considerably among versions. • It is hard to single out one ultimate version. • We present the basic ideas and details may vary.

CG © Ron Shamir, 09 a a g t c c t g a t t t g c c c a g g t * * * * * g * * * g * * * t * * * * “hot* spots” c * * * * * a * * * * * * a * * * * * * g * * * * a * * * * * * t * * * * * t * * * * * c * * * * * c * * * * * a * * * * * * t * * * * * c * * * * * a * * * * * * g * * * g * * * *

CG © Ron Shamir, 09 FASTA overview ktup = required min length of perfect match 1. Find hot spots = matches of length ktup 2. Find 10 best diagonal runs = almost consecutive hot spots on same diagonal. Best soln = init1 2.1 Find an optimal sub-alignment in each diagonal 3. Combine close sub-alignments. best soln = initn 4. Compute best DP solution in a band around initn. result = opt

CG © Ron Shamir, 09 FASTA – Step 1

Sequence B

Find hot spots: (runs of matches

of length ktup) Sequence A Sequence

CG © Ron Shamir, 09 FASTA – Step 2 Sequence B 2

Rescoring using a subs. matrix

high score

low score Sequence A Sequence The score of the highest scoring initial region is saved as the init1 score.

CG © Ron Shamir, 09 FASTA – Step 3 Sequence B 3 Joining threshold - eliminates disjointed

segments

Non-overlapping regions are joined. The score equals sum

Sequence A Sequence of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score.

CG © Ron Shamir, 09 FASTA – Step 4

Sequence B 4 Alignment optimization using dynamic programming

Sequence A Sequence The score for this alignment is the opt score.

CG © Ron Shamir, 09 A second viewpoint FASTA Algorithm (1) •ktup size: 4-6 for DNA,1-2 for 1. Look for hot spots : Common AA. subsequences of length ktup. Use lookup table/hash for efficiency a a gg t c c c g t g 2. Finding 10 best diagonal runs and init1: a * * For hits on the same diagonal, diff g ** * * between location in S,T is constant g ** * Scoring the diagonal runs - sum over t * * scores for hot spots and the inter- spots: Hot spots - positive score, c * * * Space - negative, decreases w distance c * * * Choose 10 highest diagonal runs. g ** * * 2.1 find best sub-alignment in each, t * * using a substitution matrix t * * Let init1 be the best scoring run. c * * *

CG © Ron Shamir, 09 FASTA Algorithm (2) 2. Find 10 best diagonal runs and init1 3. Allowing indels – combine close diagonal runs: Construct an alignment graph: •nodes =sub-alignments (SAs) • weight – alignment score (from 1) •Edges btw SAs that can fit together, •weight - negative, depends on the size of the corresponding gap Find a maximum weight path in it, initn

Alignment graph

CG © Ron Shamir, 09 FASTA Algorithm (3)

16/32 2. Find 10 best diagonal runs and init1 diagonals 3. combine close diagonal runs around init1 4. Compute opt (depending on ktup) Use DP to compute the best local alignment within a narrow band around init1. let opt be that alignment.

Rank the database sequences according to initn or opt scores

CG © Ron Shamir, 09 Fasta Algorithm principle

Pearson Lipman 88 CG © Ron Shamir, 09 FASTA Output

• The information on each hit includes: – General information and statistics – SW score, %identity and length of overlap

CG © Ron Shamir, 09 Statistical significance • Key question: how significant is the score x that was obtained? • Scores are not normally distributed • Solution 1: view the scores distribution over all database entries, see how far out x is.

CG © Ron Shamir, 09 Output of Fasta 2

CG © Ron Shamir, 09 ------Distribution of initial scores with ktup=2. | v initn init1 < 2 2 2:= 4 0 0: 6 4 4:==

8 18 18:======10 73 73:======12 326 326:======14 370 370:======16 1248 1248:======18 1354 1354:======20 2746 2746:======22 3151 3151:======24 6401 6401:======26 5110 5110:======28 5724 5763:======30 3887 4303:======32 2238 2682:======34 1401 1735:======36 863 1144:======38 533 690:======40 346 411:======42 481 250:======44 419 166:======46 346 110:======48 295 84:------++++++++ 50 203 47:------++++++++++++++++++++++++++ 52 184 46:------+++++++++++++++++++++++++++ 54 110 39:------++++++++++++++++++++++++++++++ 56 82 9:-----++++++++++++++++++++++++++++++++++++ 58 69 8:----+++++++++++++++++++++++++++++++ 60 71 3:--++++++++++++++++++++++++++++++++++ 62 75 1:-+++++++++++++++++++++++++++++++++++++ 64 36 1:-+++++++++++++++++ 66 31 0:++++++++++++++++ 68 17 2:-++++++++ 70 12 0:++++++ 72 12 0:++++++ 74 6 0:+++ 76 10 1:-++++ 78 28 0:++++++++++++++ 80 2 0:+ > 80 19 5:---+++++++ 13464008 residues in 38303 sequences statistics exclude scores greater than 73

http://bimas.dcrt.nih.gov/fastainfo/fastaexample.html mean initn score: 26.8 (7.79) mean init1 score: 26.0 (6.05) 5349 scores better than 33 saved, ktup: 2, variable pamfact joining threshold: 28 scan time: 0:00:30 CG------© Ron Shamir, 09 The best scores are: initn init1 opt sp|P33013|YEEC_ECOLI HYPOTHETICAL 38.9 KD PROTEIN IN S 1422 1422 1422 Statistical significance (2) • Key question: how significant is the score x that was obtained? • Solution 2:  - average score of random sequence;  - standard dev. • Z-score: z = (x- ) /  • Rule of thumb: z > 3 possibly significant, z>6 probably significant, z>10 significant • Issues: sensitivity vs selectivity. • Pertinence to biology is the bottom line

CG © Ron Shamir, 09 August 1997: NCBI Director David Lipman (far left) coaches Vice President Gore (seated) as he searches PubMed. NIH Director Harold Varmus (center) and NLM Director Donald Lindberg look on. CG © Ron Shamir, 09 Bill Pearson Bill Pearson received his Ph.D. in in 1977 from the California Institute of Technology. He then did a post- doctoral fellowships at the Caltech Marine Station in Corona del Mar, CA and at the Department of Molecular Biology and Genetics at Johns Hopkins. In 1983 he joined the Department of Biochemistry at the University of Virginia.

CG © Ron Shamir, 09 BLAST Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers and Lipman 1990. • Motivation: Need to increase the speed of FASTA by finding fewer and better spots during the algorithm. • The Core of the Algorithm: Finding fewer and better hot spots, but not insisting on perfect matches in them. • Some statistical results on the significance of the results • Different versions for protein, DNA, …

CG © Ron Shamir, 09

CG © Ron Shamir, 09 BLAST – outline

• Compile a list of high scoring words with the query • Scan the database for hits • Extend hits

CG © Ron Shamir, 09 BLAST Algorithm 1 Query sequence of length L

Maximium of L-w+1 words (typically w = 3 for proteins)

For each word from the query sequence find the list of words with high score using a substitution matrix

Word list BLAST Algorithm 2 Database sequences

Word list

Exact matches of words from the word list to the database sequences BLAST Algorithm 3

Maximal Segment Pairs (MSPs)

For each exact word match, alignment is extended in both directions to find high score segments A second viewpoint BLAST - Basic Definitions

match +2, mismatch -1 • Given two sequences S1 and S2, a segment pair is a pair of equal length S =a g c t g g t t t a subsequences of S and S , 1 1 2 S =c t t g a t g g t a respectively, aligned without spaces. 2 • A locally maximal segment pair is a pair aligned without spaces (but S =a g c t g g t t t a possibly with mismatches) whose 1 alignment score cannot be improved S2=c t t g a t g g t a by extending it or shortening it.

• A maximal segment pair (MSP) in S1, S2 is a segment pair with the maximum score over all segment S1=a g c t g g t t t a S =c t t g a t g g t a pairs in S1, S2. 2

CG © Ron Shamir, 09 BLAST - The Algorithmhits

• Fix: word length w, thresholds t, C • Seek segment pairs of length w & score  t, – Compile for each w-long subseq  of the query, the list of all w-long words with similarity score  t to . – Scan the query with a shifting w-long window: find every exact occurrence of word in the list (linear in text length) • Extend each such pair, test if contained within segment pair of score  C, (local MSP) Typical w values: CG © Ron Shamir, 09 3-5 for amino acids, ~12 for nucleotides Sensitivity-Speed Tradeoff

long words short words X% (k = 15) (k = 7) Sensitivity  Speed 

Sens.

Speed

Kent WJ, Genome Research 2002 CS262 Lecture 3, Win06, Batzoglou Gene Myers, , Warren Gish

CG © Ron Shamir, 09 BLAST statistics • Theory of Karlin, Altschul, and Dembo on the distribution of the MSP of score at random • Define parameters K,  (depending on AA distribution) • Pr (finding a pair of score >S in comparing two random seqs of length m, n) = 1 – e-y where Y=Kmn e-s • Extreme value dist (or Gumbell dist) • Allow the calculated choice of smallest C

CG © Ron Shamir, 09 Sam Karlin, Steve Altschul, Amir Dembo

CG © Ron Shamir, 09 Improvement: Gapped BLAST Altschul et al. 97 • The original BLAST extends several HSPs and then attempts to combine them with gaps • The new version allows gapped extensions for the best segments passing the two hit condition • Approximately one in 50 targets sequences reaches the gapped extension step • Using DP on dynamically changing area (not a band)

CG © Ron Shamir, 09 The sensitivity of the two-hit and one-hit heuristics as a function of HSP score.

CG © Ron Shamir, 09 Gapped BLAST outline • Find two nearby hits: • Find two non-overlapping w-long words with: – score  t, each – on same diagonal – within distance  A • Perform ungapped extension • If score exceeds S, perform gapped extension • Apply DP on a changing region: stop

extension when score falls Xg below

CG © Ron bestShamir, 09 score attained so far • Figure 2. The BLAST comparison of broad bean leghemoglobin I (87) (SWISS-PROT accession no. P02232) and horse [beta]-globin (88) (SWISS-PROT accession no. P02062). The 15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlapping hits with score at least 11 are indicated by dots. Of these 37 hits, only the two indicated pairs are on the same diagonal and within distance 40 of one another. Thus the two-hit heuristic with T = 11 triggers two extensions, in place of the 15 extensions invoked by the one-hit heuristic with T = 13. Because this is just one example, the relative numbers of hits and extensions at the various settings of T correspond only roughly to the ratios found in a full database search. An ungapped extension of the leftward of the two hit pairs CG © Ronyields Shamir, an HSP09 with nominal score 45, or 23.6 bits, calculated using [lambda]u and Ku.

Figure 3. A gapped extension generated by BLAST for the comparison of broad bean leghemoglobin I (87) and horse [beta]-globin (88). (a) The region of the path graph explored when seeded by the alignment of alanine residues at respective positions 60 and 62. This seed derives from the HSP generated by the leftward of the two ungapped extensions illustrated in Figure 2. The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and a cost of 10 + k for gaps of length k. (b) The path corresponding to the optimal local alignment generated, superimposed on the hits described in Figure 2. The original BLAST program, using the one-hit heuristic with T = 11, is able to locate three of the five HSPs included in this alignment, but only the first and last achieve a score sufficient to be reported. (c) The optimal local alignment, with nominal score 75 and normalized score 32.4 bits. In the context of a search of SWISS-PROT (26), release 34 (21 219 450 residues), using the leghemoglobin sequence (143 residues) as query, the E-value is 0.54 if no edge-effect correction (22) is invoked. The original BLAST program locates the first and last ungapped segments of this alignment. Using sum-statistics with no edge-effect correction, this combined result has an E-value of 31 (21,22). On the central lines of the alignment, identities are echoed and substitutions to which the BLOSUM-62 matrix (18) gives a positive score are indicated by a `+'

CG © Ron Shamir, 09 CG © Ron Shamir, 09 Figure 4. The path graph region explored by BLAST during a gapped extension for the comparison of broad bean leghemoglobin I and the E1B protein small T-antigen from human adenovirus type 4 (89) (SWISS-PROT accession no. P10406). The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and 10 + k gap costs. The 22.7 bit HSP that triggers this extension, involving leghemoglobin residues 119-140 and adenovirus residues 101-122, is merely a random similarity, and not part of a larger and higher-scoring alignment. The gapped extension is seeded by the alignment of residues 124 and 106. The optimal alignment score through points in the path graph drops steadily as one moves beyond the triggering HSP, and the reverse extension terminates before the beginning of either protein is reached. A total of 2766 path graph cells are explored, with the reverse extension accounting for 2047 of these cells. CG © Ron Shamir, 09 Relative times spent by the original and gapped BLAST programs on various algorithmic stages

Overhead: Calculating whether hits Ungapped Gapped database scanning, qualify for ungapped extensions extensions output, etc. extension

Original 8 (8%) 92 (92%) BLAST Gapped 8 (24%) 12 (37%) 5 (15%) 8 (24%) BLAST

Speed: ~3 times faster than the original BLAST

CG © Ron Shamir, 09 Psi-BLAST team

Thomas Madden, David Lipman, Alex Schaeffer, CG ©Steve Ron Shamir, 09Altschul PSI - BLAST

1. Execute BLAST 2. Compile a PSSM – position specific score matrix from the resulting hits 3. Execute BLAST with the new profile 4. Iterate to convergence 5. Idea: converge on a family http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

CG © Ron Shamir, 09