
Sequence Alignment Heuristics Some slides from: • Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt • Serafim Batzoglu, Stanford http://ai.stanford.edu/~serafim/ • Geoffrey J. Barton, Oxford “Protein Sequence Alignment and Database Scanning” http://www.compbio.dundee.ac.uk/ftp/preprints/review93/review93.pdf CG © Ron Shamir, 09 Why Heuristics ? • Motivation: – Dynamic programming guarantees an optimal solution & is efficient, but – Not fast enough when searching a database of size ~1012, with a query of length 200-500bp • Solutions: – Implement on hardware. (COMPUGEN) – Parallel hardware. (MASSPAR) – Ad-hoc implementations using specific hardware. – Use faster heuristic algorithms. • Common Heuristics: FASTA, BLAST CG © Ron Shamir, 09 GenBank Growth CG © Ron Shamir, 09 www.ncbi.nlm.nih.gov/Genbank/benbanstats.html GenBank Growth CG © Ron Shamir, 09 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html October 15 2009: 108,560,236,506 bases CG © Ron Shamir, 09 Key observations • Even O(m+n) time would be problematic when db size is huge • Substitutions are much more likely than indels • Homologous sequences contain many matches • Numerous queries are run on the same db Preprocessing of the db is desirable CG © Ron Shamir, 09 Indexing-based local alignment Dictionary: …… All words of length k (~10) query Alignment initiated between words of alignment score T …… Alignment: Ungapped extensions until score scan below statistical threshold DB Output: All local alignments with score > statistical threshold query CS262 Lecture 3, Win06, Batzoglou Detour: Banded Alignment Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(N) ( say N>M ) xi Then, | implies | i – j | < k(N) yj We can align x and y more efficiently: Time, Space: O(N k(N)) << O(N2) CS262 Lecture 2, Win06, Batzoglou Banded Alignment Initialization: x1 ………………………… xM F(i,0), F(0,j) undefined for i, j > k 1 Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k(N) ………………………… y F(i – 1, j) – d, if j < i + k(N) N y k(N) Termination: same Easy to extend to the affine gap case CS262 Lecture 2, Win06, Batzoglou Alignment Dot-Plot Matrix a a g t c c c g t g a * * g * * * g * * t * * c * * * c * * * g * * * t * * t * * CG © Ron Shamir,c 09 * * * Dot plots Example 1: close protein homologs (man and mouse) CG © Ron Shamir, 09 www.bioinfo.rpi.edu/~zukerm/Bio-5495/ Example 2: remote protein homologs (man and bacilus) CG © Ron Shamir, 09 Example 1: dot for 4+ matches in window of 5 CG © Ron Shamir, 09 Example 2: dot for 4+ matches in window of 5 CG © Ron Shamir, 09 FASTA : A Heuristic Method for Sequence Comparison • History: Lipman and Pearson in 1985, 1988 • Key idea: Good local alignment must have exact matching subsequences. • Algorithm Evaluation: – Resulting alignment scores well compared to the optimal alignment (shown experimentally) – Much faster than dynamic programming. CG © Ron Shamir, 09 Disclaimer • Highly popular software tools get numerous updates, revisions, versions, variants etc. • Implementation details differ considerably among versions. • It is hard to single out one ultimate version. • We present the basic ideas and details may vary. CG © Ron Shamir, 09 a a g t c c t g a t t t g c c c a g g t * * * * * g * * * g * * * t * * * * “hot* spots” c * * * * * a * * * * * * a * * * * * * g * * * * a * * * * * * t * * * * * t * * * * * c * * * * * c * * * * * a * * * * * * t * * * * * c * * * * * a * * * * * * g * * * g * * * * CG © Ron Shamir, 09 FASTA overview ktup = required min length of perfect match 1. Find hot spots = matches of length ktup 2. Find 10 best diagonal runs = almost consecutive hot spots on same diagonal. Best soln = init1 2.1 Find an optimal sub-alignment in each diagonal 3. Combine close sub-alignments. best soln = initn 4. Compute best DP solution in a band around initn. result = opt CG © Ron Shamir, 09 FASTA – Step 1 Sequence B Find hot spots: (runs of matches of length ktup) Sequence A Sequence CG © Ron Shamir, 09 FASTA – Step 2 Sequence B 2 Rescoring using a subs. matrix high score low score Sequence A Sequence The score of the highest scoring initial region is saved as the init1 score. CG © Ron Shamir, 09 FASTA – Step 3 Sequence B 3 Joining threshold - eliminates disjointed segments Non-overlapping regions are joined. The score equals sum Sequence A Sequence of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score. CG © Ron Shamir, 09 FASTA – Step 4 Sequence B 4 Alignment optimization using dynamic programming Sequence A Sequence The score for this alignment is the opt score. CG © Ron Shamir, 09 A second viewpoint FASTA Algorithm (1) •ktup size: 4-6 for DNA,1-2 for 1. Look for hot spots : Common AA. subsequences of length ktup. Use lookup table/hash for efficiency a a gg t c c c g t g 2. Finding 10 best diagonal runs and init1: a * * For hits on the same diagonal, diff g ** * * between location in S,T is constant g ** * Scoring the diagonal runs - sum over t * * scores for hot spots and the inter- spots: Hot spots - positive score, c * * * Space - negative, decreases w distance c * * * Choose 10 highest diagonal runs. g ** * * 2.1 find best sub-alignment in each, t * * using a substitution matrix t * * Let init1 be the best scoring run. c * * * CG © Ron Shamir, 09 FASTA Algorithm (2) 2. Find 10 best diagonal runs and init1 3. Allowing indels – combine close diagonal runs: Construct an alignment graph: •nodes =sub-alignments (SAs) • weight – alignment score (from 1) •Edges btw SAs that can fit together, •weight - negative, depends on the size of the corresponding gap Find a maximum weight path in it, initn Alignment graph CG © Ron Shamir, 09 FASTA Algorithm (3) 16/32 2. Find 10 best diagonal runs and init1 diagonals 3. combine close diagonal runs around init1 4. Compute opt (depending on ktup) Use DP to compute the best local alignment within a narrow band around init1. let opt be that alignment. Rank the database sequences according to initn or opt scores CG © Ron Shamir, 09 Fasta Algorithm principle Pearson Lipman 88 CG © Ron Shamir, 09 FASTA Output • The information on each hit includes: – General information and statistics – SW score, %identity and length of overlap CG © Ron Shamir, 09 Statistical significance • Key question: how significant is the score x that was obtained? • Scores are not normally distributed • Solution 1: view the scores distribution over all database entries, see how far out x is. CG © Ron Shamir, 09 Output of Fasta 2 CG © Ron Shamir, 09 -------------------------------------------------------------------------------- Distribution of initial scores with ktup=2. | v initn init1 < 2 2 2:= 4 0 0: 6 4 4:== 8 18 18:========= 10 73 73:===================================== 12 326 326:================================================== 14 370 370:================================================== 16 1248 1248:================================================== 18 1354 1354:================================================== 20 2746 2746:================================================== 22 3151 3151:================================================== 24 6401 6401:================================================== 26 5110 5110:================================================== 28 5724 5763:================================================== 30 3887 4303:================================================== 32 2238 2682:================================================== 34 1401 1735:================================================== 36 863 1144:================================================== 38 533 690:================================================== 40 346 411:================================================== 42 481 250:================================================== 44 419 166:================================================== 46 346 110:================================================== 48 295 84:------------------------------------------++++++++ 50 203 47:------------------------++++++++++++++++++++++++++ 52 184 46:-----------------------+++++++++++++++++++++++++++ 54 110 39:--------------------++++++++++++++++++++++++++++++ 56 82 9:-----++++++++++++++++++++++++++++++++++++ 58 69 8:----+++++++++++++++++++++++++++++++ 60 71 3:--++++++++++++++++++++++++++++++++++ 62 75 1:-+++++++++++++++++++++++++++++++++++++ 64 36 1:-+++++++++++++++++ 66 31 0:++++++++++++++++ 68 17 2:-++++++++ 70 12 0:++++++ 72 12 0:++++++ 74 6 0:+++ 76 10 1:-++++ 78 28 0:++++++++++++++ 80 2 0:+ > 80 19 5:---+++++++ 13464008 residues in 38303 sequences statistics exclude scores greater than 73 http://bimas.dcrt.nih.gov/fastainfo/fastaexample.html mean initn score: 26.8 (7.79) mean init1 score: 26.0 (6.05) 5349 scores better than 33 saved, ktup: 2, variable pamfact joining threshold: 28 scan time: 0:00:30 CG-------------------------------------------------------------------------------- © Ron Shamir, 09 The best scores are: initn init1 opt sp|P33013|YEEC_ECOLI HYPOTHETICAL 38.9 KD PROTEIN IN S 1422 1422 1422 Statistical significance (2) • Key question: how significant is the score x that was obtained? • Solution 2: - average score of random sequence; - standard dev. • Z-score: z = (x- ) / • Rule of thumb: z > 3 possibly significant, z>6 probably significant, z>10 significant • Issues: sensitivity vs selectivity. • Pertinence to biology is the bottom line CG © Ron Shamir, 09 August 1997: NCBI Director David Lipman (far left) coaches Vice President Gore (seated)
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages55 Page
-
File Size-