Heuristics

Some slides from: • Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt • Serafim Batzoglu, Stanford http://ai.stanford.edu/~serafim/ • Geoffrey J. Barton, Oxford “ Sequence Alignment and Database Scanning” http://www.compbio.dundee.ac.uk/ftp/preprints/review93/review93.pdf © CG 2015 Why Heuristics ?

• Motivation: – Dynamic programming guarantees an optimal solution & is efficient, but – Not fast enough when searching a database of size ~1012, with a query of length 200-500bp

© CG 2015 GenBank Growth

© CG 2015 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Possible Solutions

• Solutions: – Implement on hardware. (COMPUGEN) – Parallel hardware. (MASSPAR) – Ad-hoc implementations using specific hardware. – Use faster heuristic algorithms. • Limit the number of allowed indels. • Look for “long” matching subsequences. • Use indexing/hashing. • Common Heuristics: FASTA, BLAST

CG © Ron Shamir, 09 Key observations

• Even O(m+n) time would be problematic when db size is huge • Substitutions are much more likely than indels • Homologous sequences contain many matches • Numerous queries are run on the same db  Preprocessing of the db is desirable

© CG 2015 Indexing-based local alignment

Dictionary: …… All words of length k (~10) query Alignment initiated between words of alignment score  T

…… Alignment: Ungapped extensions until score scan below statistical threshold DB

Output: All local alignments with score > statistical threshold query

CS262 Lecture 3, Win06, Batzoglou Detour: Banded Alignment

Assume we know that x and y are very similar

Assumption: # gaps(x, y) < k(N) ( say N>M )

xi Then, | implies | i – j | < k(N)

yj

We can align x and y more efficiently:

Time, Space: O(N  k(N)) << O(N2)

CS262 Lecture 2, Win06, Batzoglou Banded Alignment

Initialization:

x1 ………………………… xM F(i,0), F(0,j) undefined for i, j > k 1

Iteration:

For i = 1…M For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ s(xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k(N) ………………………… ………………………… y

F(i – 1, j) – d, if j < i + k(N) N y k(N) Termination: same

Easy to extend to the affine gap case

CS262 Lecture 2, Win06, Batzoglou Alignment Dot-Plot Matrix

a a g t c c c g t g a * * g * * * g * * t * * c * * * c * * * g * * * t * * t * *

© CG 2015 c * * * Dot plots Example 1: close protein homologs (man and mouse)

© CG 2015 www.bioinfo.rpi.edu/~zukerm/Bio-5495/ Example 2: remote protein homologs (man and bacilus)

© CG 2015 Example 1: dot for 4+ matches in window of 5

© CG 2015 Example 2: dot for 4+ matches in window of 5

© CG 2015 FASTA : A Heuristic Method for Sequence Comparison • History: Lipman and Pearson in 1985, 1988 • Key idea: Good local alignment must have exact matching subsequences. • Algorithm Evaluation: – Resulting alignment scores well compared to the optimal alignment (shown experimentally) – Much faster than dynamic programming.

© CG 2015 Disclaimer

• Highly popular software tools get numerous updates, revisions, versions, variants etc. • Implementation details differ considerably among versions. • It is hard to single out one ultimate version. • We present the basic ideas and details may vary.

© CG 2015 a a g t c c t g a t t t g c c c a g g t * * * * * g * * * g * * * t * * * * “hot* spots” c * * * * * a * * * * * * a * * * * * * g * * * * a * * * * * * t * * * * * t * * * * * c * * * * * c * * * * * a * * * * * * t * * * * * c * * * * * a * * * * * * g * * * g * * * *

© CG 2015 FASTA overview ktup = required min length of perfect match 1. Find hot spots = matches of length ktup 2. Find 10 best diagonal runs = almost consecutive hot spots on same diagonal. Best soln = init1 2.1 Find an optimal sub-alignment in each diagonal 3. Combine close sub-alignments. best soln = initn 4. Compute best DP solution in a band around initn. result = opt

© CG 2015 FASTA – Step 1

Sequence B

Find hot spots: (runs of matches

of length ktup) Sequence A Sequence

© CG 2015 FASTA – Step 2 Sequence B 2

Rescoring using a subs. matrix

high score

low score Sequence A Sequence The score of the highest scoring initial region is saved as the init1 score.

© CG 2015 FASTA – Step 3 Sequence B 3 Joining threshold - eliminates disjointed

segments

Non-overlapping regions are joined. The score equals sum

Sequence A Sequence of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score.

© CG 2015 FASTA Algorithm (2) 2. Find 10 best diagonal runs and init1 3. Allowing indels – combine close diagonal runs: Construct an alignment graph: •nodes =sub-alignments (SAs) • weight – alignment score (from 1) •Edges btw SAs that can fit together, •weight - negative, depends on the size of the corresponding gap Find a maximum weight path in it, initn

Alignment graph

© CG 2015 FASTA – Step 4

Sequence B 4 Alignment optimization using dynamic programming

Sequence A Sequence The score for this alignment is the opt score.

© CG 2015 FASTA Output

• The information on each hit includes: – General information and statistics – SW score, %identity and length of overlap

© CG 2015 Statistical significance • Key question: how significant is the score x that was obtained? • Scores are not normally distributed • Solution 1: view the scores distribution over all database entries, see how far out x is.

© CG 2015 Output of Fasta 2

© CG 2015 ------Distribution of initial scores with ktup=2. | v initn init1 < 2 2 2:= 4 0 0: 6 4 4:==

8 18 18:======10 73 73:======12 326 326:======14 370 370:======16 1248 1248:======18 1354 1354:======20 2746 2746:======22 3151 3151:======24 6401 6401:======26 5110 5110:======28 5724 5763:======30 3887 4303:======32 2238 2682:======34 1401 1735:======36 863 1144:======38 533 690:======40 346 411:======42 481 250:======44 419 166:======46 346 110:======48 295 84:------++++++++ 50 203 47:------++++++++++++++++++++++++++ 52 184 46:------+++++++++++++++++++++++++++ 54 110 39:------++++++++++++++++++++++++++++++ 56 82 9:-----++++++++++++++++++++++++++++++++++++ 58 69 8:----+++++++++++++++++++++++++++++++ 60 71 3:--++++++++++++++++++++++++++++++++++ 62 75 1:-+++++++++++++++++++++++++++++++++++++ 64 36 1:-+++++++++++++++++ 66 31 0:++++++++++++++++ 68 17 2:-++++++++ 70 12 0:++++++ 72 12 0:++++++ 74 6 0:+++ 76 10 1:-++++ 78 28 0:++++++++++++++ 80 2 0:+ > 80 19 5:---+++++++ 13464008 residues in 38303 sequences statistics exclude scores greater than 73

http://bimas.dcrt.nih.gov/fastainfo/fastaexample.html mean initn score: 26.8 (7.79) mean init1 score: 26.0 (6.05) 5349 scores better than 33 saved, ktup: 2, variable pamfact joining threshold: 28 scan time: 0:00:30 ©------CG 2015 The best scores are: initn init1 opt sp|P33013|YEEC_ECOLI HYPOTHETICAL 38.9 KD PROTEIN IN S 1422 1422 1422 Statistical significance (2) • Key question: how significant is the score x that was obtained? • Solution 2:  - average score of random sequence;  - standard dev. • Z-score: z = (x- ) /  • Rule of thumb: z > 3 possibly significant, z>6 probably significant, z>10 significant • Issues: sensitivity vs selectivity. • Pertinence to biology is the bottom line

© CG 2015 August 1997: NCBI Director David Lipman (far left) coaches Vice President Gore (seated) as he searches PubMed. NIH Director Harold Varmus (center) and NLM Director Donald Lindberg look on. © CG 2015 Bill Pearson Bill Pearson received his Ph.D. in Biochemistry in 1977 from the California Institute of Technology. He then did a post- doctoral fellowships at the Caltech Marine Station in Corona del Mar, CA and at the Department of Molecular Biology and Genetics at Johns Hopkins. In 1983 he joined the Department of Biochemistry at the University of Virginia.

© CG 2015 BLAST Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers and Lipman 1990. • Motivation: Need to increase the speed of FASTA by finding fewer and better spots during the algorithm. • The Core of the Algorithm: Finding fewer and better hot spots, but not insisting on perfect matches in them. • Some statistical results on the significance of the results • Different versions for protein, DNA, …

© CG 2015

© CG 2015 BLAST – outline

• Compile a list of high scoring words with the query • Scan the database for hits • Extend hits

© CG 2015 BLAST Algorithm 1 Query sequence of length L

Maximum of L-w+1 words (typically w = 3 for )

For each word from the query sequence find the list of words with high score using a

Word list BLAST Algorithm 2 Database sequences

Word list

Exact matches of words from the word list to the database sequences BLAST Algorithm 3

Maximal Segment Pairs (MSPs)

For each exact word match, alignment is extended in both directions to find high score segments A second viewpoint BLAST - Basic Definitions

match +2, mismatch -1 • Given two sequences S1 and S2, a segment pair is a pair of equal length S =a g c t g g t t t a subsequences of S and S , 1 1 2 S =c t t g a t g g t a respectively, aligned without spaces. 2 • A locally maximal segment pair is a pair aligned without spaces (but S =a g c t g g t t t a possibly with mismatches) whose 1 alignment score cannot be improved S2=c t t g a t g g t a by extending it or shortening it.

• A maximal segment pair (MSP) in S1, S2 is a segment pair with the maximum score over all segment S1=a g c t g g t t t a S =c t t g a t g g t a pairs in S1, S2. 2

© CG 2015 BLAST - The Algorithmhits

• Fix: word length w, thresholds t, C • Seek segment pairs of length w & score  t, – Compile for each w-long subseq  of the query, the list of all w-long words with similarity score  t to . – Scan the query with a shifting w-long window: find every exact occurrence of word in the list (linear in text length) • Extend each such pair, test if contained within segment pair of score  C, (local MSP) Typical w values: © CG 2015 3-5 for amino acids, ~12 for Sensitivity-Speed Tradeoff

long words short words X% (k = 15) (k = 7) Sensitivity  Speed 

Sens.

Speed

Kent WJ, Genome Research 2002 CS262 Lecture 3, Win06, Batzoglou Myers, Webb Miller, Warren Gish

© CG 2015 BLAST statistics • Theory of Karlin, Altschul, and Dembo on the distribution of the MSP of score at random • Define parameters K,  (depending on AA distribution) • Pr (finding a pair of score >S in comparing two random seqs of length m, n) = 1 – e-y where Y=Kmn e-s • Extreme value dist (or Gumbell dist) • Allow the calculated choice of smallest C

© CG 2015 Sam Karlin, Steve Altschul, Amir Dembo

© CG 2015 Improvement: Gapped BLAST Altschul et al. 97 • The original BLAST extends several HSPs and then attempts to combine them without gaps • The new version allows gapped extensions for the best segments passing the two hit condition • Approximately one in 50 targets sequences reaches the gapped extension step • Using DP on dynamically changing area (not a band)

© CG 2015 The sensitivity of the two-hit and one-hit heuristics as a function of HSP score.

© CG 2015 Gapped BLAST outline • Find two nearby hits: • Find two non-overlapping w-long words with: – score  t, each – on same diagonal – within distance  A • Perform ungapped extension • If score exceeds S, perform gapped extension • Apply DP on a changing region: stop

extension when score falls Xg below

© CG 2015best score attained so far • Figure 2. The BLAST comparison of broad bean leghemoglobin I (87) (SWISS-PROT accession no. P02232) and horse [beta]-globin (88) (SWISS-PROT accession no. P02062). The 15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlapping hits with score at least 11 are indicated by dots. Of these 37 hits, only the two indicated pairs are on the same diagonal and within distance 40 of one another. Thus the two-hit heuristic with T = 11 triggers two extensions, in place of the 15 extensions invoked by the one-hit heuristic with T = 13. Because this is just one example, the relative numbers of hits and extensions at the various settings of T correspond only roughly to the ratios found in a full database search. An ungapped extension of the leftward of the two hit pairs © CG 2015yields an HSP with nominal score 45, or 23.6 bits, calculated using [lambda]u and Ku.

Figure 3. A gapped extension generated by BLAST for the comparison of broad bean leghemoglobin I (87) and horse [beta]-globin (88). (a) The region of the path graph explored when seeded by the alignment of alanine residues at respective positions 60 and 62. This seed derives from the HSP generated by the leftward of the two ungapped extensions illustrated in Figure 2. The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and a cost of 10 + k for gaps of length k. (b) The path corresponding to the optimal local alignment generated, superimposed on the hits described in Figure 2. The original BLAST program, using the one-hit heuristic with T = 11, is able to locate three of the five HSPs included in this alignment, but only the first and last achieve a score sufficient to be reported. (c) The optimal local alignment, with nominal score 75 and normalized score 32.4 bits. In the context of a search of SWISS-PROT (26), release 34 (21 219 450 residues), using the leghemoglobin sequence (143 residues) as query, the E-value is 0.54 if no edge-effect correction (22) is invoked. The original BLAST program locates the first and last ungapped segments of this alignment. Using sum-statistics with no edge-effect correction, this combined result has an E-value of 31 (21,22). On the central lines of the alignment, identities are echoed and substitutions to which the BLOSUM-62 matrix (18) gives a positive score are indicated by a `+'

© CG 2015 © CG 2015 Figure 4. The path graph region explored by BLAST during a gapped extension for the comparison of broad bean leghemoglobin I and the E1B protein small T-antigen from human adenovirus type 4 (89) (SWISS-PROT accession no. P10406). The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and 10 + k gap costs. The 22.7 bit HSP that triggers this extension, involving leghemoglobin residues 119-140 and adenovirus residues 101-122, is merely a random similarity, and not part of a larger and higher-scoring alignment. The gapped extension is seeded by the alignment of residues 124 and 106. The optimal alignment score through points in the path graph drops steadily as one moves beyond the triggering HSP, and the reverse extension terminates before the beginning of either protein is reached. A total of 2766 path graph cells are explored, with the reverse extension accounting for 2047 of these cells. © CG 2015 Relative times spent by the original and gapped BLAST programs on various algorithmic stages

Overhead: Calculating whether hits Ungapped Gapped database scanning, qualify for ungapped extensions extensions output, etc. extension

Original 8 (8%) 92 (92%) BLAST Gapped 8 (24%) 12 (37%) 5 (15%) 8 (24%) BLAST

Speed: ~3 times faster than the original BLAST

© CG 2015 Psi-BLAST team

Thomas Madden, David Lipman, Alex Schaeffer, © CG Steve 2015 Altschul PSI - BLAST

1. Execute BLAST 2. Compile a PSSM – position specific score matrix from the resulting hits 3. Execute BLAST with the new profile 4. Iterate to convergence 5. Idea: converge on a family http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

© CG 2015 Substitution Matrices

CG © Ron Shamir, 09 Substitution Matrices • Used to score alignments. • Reflect evolution of sequences. Unitary Matrix:

Mij = 1 i=j {0 o/w Matrix:

Mij = min no. of base changes needed to alter codon of i to codon of j.

CG © Ron Shamir, 09 Scoring Matrices

• Probability theory implies that more similar pairs of sequences will require different matrices than more divergent pairs. • Several families of matrices were constructed, to be used according to the level of divergence: – Probabilistic/evolutionary (global) approach - PAM. – Functional approach (local) – BLOSUM . • Higher numbered PAM and Lower numbered BLOSUM for more divergent sequences

CG © Ron Shamir, 09 PAM Matrices (Dayhoff et al., 78)

• PAM = Percent (or Point) Accepted • Measuring unit of evolutionary distance of proteins. • Substitution matrix for comparing proteins that distance apart.

• Protein sequences S1, S2 are at evolutionary distance of one PAM if S1 has converted to S2 with an average of one accepted per 100 AAs.: – PAM1 should be used for sequences whose evolutionary distance causes 1% difference (Percent Accepted Mutation) between them. – PAM2 should be used for sequences twice as distant. – … CG © Ron Shamir, 09 PAM Matrices (2)

Generating PAM: ABCD AGCF ADIJ CBIJ • Start with aligned GB DF BD AC sequences, highly AGCD ABIJ similar, with known GB evolutionary trees. CI DJ • Collect statistics on exchanges

• Compute matrix Mij = “prob.”(j changes to i in one unit) • Now Mk gives change probs. in k units. f ( j)M k (i, j) M k (i, j) "log odds" log  log CG © Ron Shamir, 09 f (i) f ( j) f (i) Properties and caveats

• Markovian model: state at time n depends only on state at time n-1 • Same model for all AA positions • Multiple can - and will - occur at same point. • We count only accepted ( recorded) mutations. • Assumes constant . • Ignores indels. • k PAM difference  k % difference !!!

CG © Ron Shamir, 09 Observed % Evolutionary distance difference in PAMs 1 1 5 5 10 11 15 17 20 23 30 38 40 56 50 80 55 94 60 112 70 159 75 195 80 246

CG © Ron Shamir, 09 85 328 Dayhoff’s Data

• 71 manually curated evolutionary trees (34 superfamilies) • Sequences within a tree were <15% different • 1,572 substitutions overall

CG © Ron Shamir, 09 CG © Ron Shamir, 09 CG © Ron Shamir, 09 CG © Ron Shamir, 09 (1925-1983)

A pioneer in the use of computers in chemistry and biology, beginning with her PhD thesis project in 1948. Her work was multi-disciplinary, and used her knowledge of chemistry, mathematics, biology and computer science to develop an entirely new field. She is credited today as one of the founders of the field of . Dr. Dayhoff was the first woman in the field of Bioinformatics. She was also the first woman to hold office in the Biophysical Society, serving first as Secretary and later as President.

CG © Ron Shamir, 09 CG © Ron Shamir, 09 BLOSUM (Henikoff & Henikoff, 92)

• PAM: based on highly similar global alignments • BLOSUM (BLOcks SUbstitution Matrix): based on short, gapless local alignments – Identify blocks: conserved segments in alignment of proteins from the same family. – Eliminate sequences that are >x% identical (by deletion/clustering) – Collect stats on pairs in each column

– qij = prob of AA pairs (Ai, Aj) in same column – pi = prob of observing Ai 2 – eij = freq. of pair (Ai, Aj) assuming independence =pi if i=j, 2pipj if ij – Odds matrix: qij/eij . Log odds: sij = log (qij/eij ) – BLOSUM X matrix: 2sij discretized

CG © Ron Shamir, 09 Blosum62

CG © Ron Shamir, 09 Comparing matrices

CG © Ron Shamir, 09 PAM vs

BLOSUM in different algorithms

CG © Ron Shamir, 09 Steven & Jorja Henikoff

CG © Ron Shamir, 09 One recipe for selecting a matrix • Compared sequences are related:Low PAM: 200 PAM or 250 PAM short segments, high similarity • Database scanning: 120 PAM High PAM: long segments, • Local alignment search: low similarity 40 PAM, 120 PAM, 250 PAM • Detection of related sequences using BLAST: BLOSUM 62 THERE IS NO “ONE SIZE FITS ALL” MATRIX !