Lecture 3 Sequence Alignment Heuristics Substitution Matrices

Sequence Alignment Heuristics Some slides from: • Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt • Serafim Batzoglu, Stanford http://ai.stanford.edu/~serafim/ • Geoffrey J. Barton, Oxford “Protein Sequence Alignment and Database Scanning” http://www.compbio.dundee.ac.uk/ftp/preprints/review93/review93.pdf CG © Ron Shamir, 09 Why Heuristics ? • Motivation: – Dynamic programming guarantees an optimal solution & is efficient, but – Not fast enough when searching a database of size ~1012, with a query of length 200-500bp • Solutions: – Implement on hardware. (COMPUGEN) – Parallel hardware. (MASSPAR) – Ad-hoc implementations using specific hardware. – Use faster heuristic algorithms. • Common Heuristics: FASTA, BLAST CG © Ron Shamir, 09 GenBank Growth CG © Ron Shamir, 09 www.ncbi.nlm.nih.gov/Genbank/benbanstats.html GenBank Growth CG © Ron Shamir, 09 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html October 15 2009: 108,560,236,506 bases CG © Ron Shamir, 09 Key observations • Even O(m+n) time would be problematic when db size is huge • Substitutions are much more likely than indels • Homologous sequences contain many matches • Numerous queries are run on the same db Preprocessing of the db is desirable CG © Ron Shamir, 09 Indexing-based local alignment Dictionary: …… All words of length k (~10) query Alignment initiated between words of alignment score T …… Alignment: Ungapped extensions until score scan below statistical threshold DB Output: All local alignments with score > statistical threshold query CS262 Lecture 3, Win06, Batzoglou Detour: Banded Alignment Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(N) ( say N>M ) xi Then, | implies | i – j | < k(N) yj We can align x and y more efficiently: Time, Space: O(N k(N)) << O(N2) CS262 Lecture 2, Win06, Batzoglou Banded Alignment Initialization: x1 ………………………… xM F(i,0), F(0,j) undefined for i, j > k 1 Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k(N) ………………………… y F(i – 1, j) – d, if j < i + k(N) N y k(N) Termination: same Easy to extend to the affine gap case CS262 Lecture 2, Win06, Batzoglou Alignment Dot-Plot Matrix a a g t c c c g t g a * * g * * * g * * t * * c * * * c * * * g * * * t * * t * * CG © Ron Shamir,c 09 * * * Dot plots Example 1: close protein homologs (man and mouse) CG © Ron Shamir, 09 www.bioinfo.rpi.edu/~zukerm/Bio-5495/ Example 2: remote protein homologs (man and bacilus) CG © Ron Shamir, 09 Example 1: dot for 4+ matches in window of 5 CG © Ron Shamir, 09 Example 2: dot for 4+ matches in window of 5 CG © Ron Shamir, 09 FASTA : A Heuristic Method for Sequence Comparison • History: Lipman and Pearson in 1985, 1988 • Key idea: Good local alignment must have exact matching subsequences. • Algorithm Evaluation: – Resulting alignment scores well compared to the optimal alignment (shown experimentally) – Much faster than dynamic programming. CG © Ron Shamir, 09 Disclaimer • Highly popular software tools get numerous updates, revisions, versions, variants etc. • Implementation details differ considerably among versions. • It is hard to single out one ultimate version. • We present the basic ideas and details may vary. CG © Ron Shamir, 09 a a g t c c t g a t t t g c c c a g g t * * * * * g * * * g * * * t * * * * “hot* spots” c * * * * * a * * * * * * a * * * * * * g * * * * a * * * * * * t * * * * * t * * * * * c * * * * * c * * * * * a * * * * * * t * * * * * c * * * * * a * * * * * * g * * * g * * * * CG © Ron Shamir, 09 FASTA overview ktup = required min length of perfect match 1. Find hot spots = matches of length ktup 2. Find 10 best diagonal runs = almost consecutive hot spots on same diagonal. Best soln = init1 2.1 Find an optimal sub-alignment in each diagonal 3. Combine close sub-alignments. best soln = initn 4. Compute best DP solution in a band around initn. result = opt CG © Ron Shamir, 09 FASTA – Step 1 Sequence B Find hot spots: (runs of matches of length ktup) Sequence A Sequence CG © Ron Shamir, 09 FASTA – Step 2 Sequence B 2 Rescoring using a subs. matrix high score low score Sequence A Sequence The score of the highest scoring initial region is saved as the init1 score. CG © Ron Shamir, 09 FASTA – Step 3 Sequence B 3 Joining threshold - eliminates disjointed segments Non-overlapping regions are joined. The score equals sum Sequence A Sequence of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score. CG © Ron Shamir, 09 FASTA – Step 4 Sequence B 4 Alignment optimization using dynamic programming Sequence A Sequence The score for this alignment is the opt score. CG © Ron Shamir, 09 A second viewpoint FASTA Algorithm (1) •ktup size: 4-6 for DNA,1-2 for 1. Look for hot spots : Common AA. subsequences of length ktup. Use lookup table/hash for efficiency a a gg t c c c g t g 2. Finding 10 best diagonal runs and init1: a * * For hits on the same diagonal, diff g ** * * between location in S,T is constant g ** * Scoring the diagonal runs - sum over t * * scores for hot spots and the inter- spots: Hot spots - positive score, c * * * Space - negative, decreases w distance c * * * Choose 10 highest diagonal runs. g ** * * 2.1 find best sub-alignment in each, t * * using a substitution matrix t * * Let init1 be the best scoring run. c * * * CG © Ron Shamir, 09 FASTA Algorithm (2) 2. Find 10 best diagonal runs and init1 3. Allowing indels – combine close diagonal runs: Construct an alignment graph: •nodes =sub-alignments (SAs) • weight – alignment score (from 1) •Edges btw SAs that can fit together, •weight - negative, depends on the size of the corresponding gap Find a maximum weight path in it, initn Alignment graph CG © Ron Shamir, 09 FASTA Algorithm (3) 16/32 2. Find 10 best diagonal runs and init1 diagonals 3. combine close diagonal runs around init1 4. Compute opt (depending on ktup) Use DP to compute the best local alignment within a narrow band around init1. let opt be that alignment. Rank the database sequences according to initn or opt scores CG © Ron Shamir, 09 Fasta Algorithm principle Pearson Lipman 88 CG © Ron Shamir, 09 FASTA Output • The information on each hit includes: – General information and statistics – SW score, %identity and length of overlap CG © Ron Shamir, 09 Statistical significance • Key question: how significant is the score x that was obtained? • Scores are not normally distributed • Solution 1: view the scores distribution over all database entries, see how far out x is. CG © Ron Shamir, 09 Output of Fasta 2 CG © Ron Shamir, 09 -------------------------------------------------------------------------------- Distribution of initial scores with ktup=2. | v initn init1 < 2 2 2:= 4 0 0: 6 4 4:== 8 18 18:========= 10 73 73:===================================== 12 326 326:================================================== 14 370 370:================================================== 16 1248 1248:================================================== 18 1354 1354:================================================== 20 2746 2746:================================================== 22 3151 3151:================================================== 24 6401 6401:================================================== 26 5110 5110:================================================== 28 5724 5763:================================================== 30 3887 4303:================================================== 32 2238 2682:================================================== 34 1401 1735:================================================== 36 863 1144:================================================== 38 533 690:================================================== 40 346 411:================================================== 42 481 250:================================================== 44 419 166:================================================== 46 346 110:================================================== 48 295 84:------------------------------------------++++++++ 50 203 47:------------------------++++++++++++++++++++++++++ 52 184 46:-----------------------+++++++++++++++++++++++++++ 54 110 39:--------------------++++++++++++++++++++++++++++++ 56 82 9:-----++++++++++++++++++++++++++++++++++++ 58 69 8:----+++++++++++++++++++++++++++++++ 60 71 3:--++++++++++++++++++++++++++++++++++ 62 75 1:-+++++++++++++++++++++++++++++++++++++ 64 36 1:-+++++++++++++++++ 66 31 0:++++++++++++++++ 68 17 2:-++++++++ 70 12 0:++++++ 72 12 0:++++++ 74 6 0:+++ 76 10 1:-++++ 78 28 0:++++++++++++++ 80 2 0:+ > 80 19 5:---+++++++ 13464008 residues in 38303 sequences statistics exclude scores greater than 73 http://bimas.dcrt.nih.gov/fastainfo/fastaexample.html mean initn score: 26.8 (7.79) mean init1 score: 26.0 (6.05) 5349 scores better than 33 saved, ktup: 2, variable pamfact joining threshold: 28 scan time: 0:00:30 CG-------------------------------------------------------------------------------- © Ron Shamir, 09 The best scores are: initn init1 opt sp|P33013|YEEC_ECOLI HYPOTHETICAL 38.9 KD PROTEIN IN S 1422 1422 1422 Statistical significance (2) • Key question: how significant is the score x that was obtained? • Solution 2: - average score of random sequence; - standard dev. • Z-score: z = (x- ) / • Rule of thumb: z > 3 possibly significant, z>6 probably significant, z>10 significant • Issues: sensitivity vs selectivity. • Pertinence to biology is the bottom line CG © Ron Shamir, 09 August 1997: NCBI Director David Lipman (far left) coaches Vice President Gore (seated)

Lecture 3 Sequence Alignment Heuristics Substitution Matrices

(12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat Et Al

UNIVERSITY of CALIFORNIA, SAN DIEGO Use Solid K-Mers In

Developing Bioinformatics Computer Skills.Pdf

The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM

Msc THESIS Genetic Sequence Alignment on a Supercomputing Platform

Computation Resources for Molecular Biology: a Special Issue

Coding Sequences: a History of Sequence Comparison Algorithms As a Scientiªc Instrument

STUDY of the RELATIONSHIP BETWEEN Mus Musculus PROTEIN SEQUENCES and THEIR BIOLOGICAL FUNCTIONS a Thesis Presented to the Gradua

K-Mulus: a Database-Clustering Approach to Protein BLAST in the Cloud

In Its Most Basic Form a Sequence Alignment Is Simply Comparing Two Or More Sequences by Searching for Character Patterns and Other Similarities

Thesis by Submitted in Partial Fulfillment of the Requirements For

Scalable Parallel Algorithms for Genome Analysis by Evangelos Georganas a Dissertation Submitted in Partial Satisfaction Of