Lecture 3 Sequence Alignment Heuristics Substitution Matrices
Total Page:16
File Type:pdf, Size:1020Kb
Sequence Alignment Heuristics Some slides from: • Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt • Serafim Batzoglu, Stanford http://ai.stanford.edu/~serafim/ • Geoffrey J. Barton, Oxford “Protein Sequence Alignment and Database Scanning” http://www.compbio.dundee.ac.uk/ftp/preprints/review93/review93.pdf © CG 2015 Why Heuristics ? • Motivation: – Dynamic programming guarantees an optimal solution & is efficient, but – Not fast enough when searching a database of size ~1012, with a query of length 200-500bp © CG 2015 GenBank Growth © CG 2015 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Possible Solutions • Solutions: – Implement on hardware. (COMPUGEN) – Parallel hardware. (MASSPAR) – Ad-hoc implementations using specific hardware. – Use faster heuristic algorithms. • Limit the number of allowed indels. • Look for “long” matching subsequences. • Use indexing/hashing. • Common Heuristics: FASTA, BLAST CG © Ron Shamir, 09 Key observations • Even O(m+n) time would be problematic when db size is huge • Substitutions are much more likely than indels • Homologous sequences contain many matches • Numerous queries are run on the same db Preprocessing of the db is desirable © CG 2015 Indexing-based local alignment Dictionary: …… All words of length k (~10) query Alignment initiated between words of alignment score T …… Alignment: Ungapped extensions until score scan below statistical threshold DB Output: All local alignments with score > statistical threshold query CS262 Lecture 3, Win06, Batzoglou Detour: Banded Alignment Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(N) ( say N>M ) xi Then, | implies | i – j | < k(N) yj We can align x and y more efficiently: Time, Space: O(N k(N)) << O(N2) CS262 Lecture 2, Win06, Batzoglou Banded Alignment Initialization: x1 ………………………… xM F(i,0), F(0,j) undefined for i, j > k 1 Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k(N) ………………………… y F(i – 1, j) – d, if j < i + k(N) N y k(N) Termination: same Easy to extend to the affine gap case CS262 Lecture 2, Win06, Batzoglou Alignment Dot-Plot Matrix a a g t c c c g t g a * * g * * * g * * t * * c * * * c * * * g * * * t * * t * * © CG 2015 c * * * Dot plots Example 1: close protein homologs (man and mouse) © CG 2015 www.bioinfo.rpi.edu/~zukerm/Bio-5495/ Example 2: remote protein homologs (man and bacilus) © CG 2015 Example 1: dot for 4+ matches in window of 5 © CG 2015 Example 2: dot for 4+ matches in window of 5 © CG 2015 FASTA : A Heuristic Method for Sequence Comparison • History: Lipman and Pearson in 1985, 1988 • Key idea: Good local alignment must have exact matching subsequences. • Algorithm Evaluation: – Resulting alignment scores well compared to the optimal alignment (shown experimentally) – Much faster than dynamic programming. © CG 2015 Disclaimer • Highly popular software tools get numerous updates, revisions, versions, variants etc. • Implementation details differ considerably among versions. • It is hard to single out one ultimate version. • We present the basic ideas and details may vary. © CG 2015 a a g t c c t g a t t t g c c c a g g t * * * * * g * * * g * * * t * * * * “hot* spots” c * * * * * a * * * * * * a * * * * * * g * * * * a * * * * * * t * * * * * t * * * * * c * * * * * c * * * * * a * * * * * * t * * * * * c * * * * * a * * * * * * g * * * g * * * * © CG 2015 FASTA overview ktup = required min length of perfect match 1. Find hot spots = matches of length ktup 2. Find 10 best diagonal runs = almost consecutive hot spots on same diagonal. Best soln = init1 2.1 Find an optimal sub-alignment in each diagonal 3. Combine close sub-alignments. best soln = initn 4. Compute best DP solution in a band around initn. result = opt © CG 2015 FASTA – Step 1 Sequence B Find hot spots: (runs of matches of length ktup) Sequence A Sequence © CG 2015 FASTA – Step 2 Sequence B 2 Rescoring using a subs. matrix high score low score Sequence A Sequence The score of the highest scoring initial region is saved as the init1 score. © CG 2015 FASTA – Step 3 Sequence B 3 Joining threshold - eliminates disjointed segments Non-overlapping regions are joined. The score equals sum Sequence A Sequence of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score. © CG 2015 FASTA Algorithm (2) 2. Find 10 best diagonal runs and init1 3. Allowing indels – combine close diagonal runs: Construct an alignment graph: •nodes =sub-alignments (SAs) • weight – alignment score (from 1) •Edges btw SAs that can fit together, •weight - negative, depends on the size of the corresponding gap Find a maximum weight path in it, initn Alignment graph © CG 2015 FASTA – Step 4 Sequence B 4 Alignment optimization using dynamic programming Sequence A Sequence The score for this alignment is the opt score. © CG 2015 FASTA Output • The information on each hit includes: – General information and statistics – SW score, %identity and length of overlap © CG 2015 Statistical significance • Key question: how significant is the score x that was obtained? • Scores are not normally distributed • Solution 1: view the scores distribution over all database entries, see how far out x is. © CG 2015 Output of Fasta 2 © CG 2015 -------------------------------------------------------------------------------- Distribution of initial scores with ktup=2. | v initn init1 < 2 2 2:= 4 0 0: 6 4 4:== 8 18 18:========= 10 73 73:===================================== 12 326 326:================================================== 14 370 370:================================================== 16 1248 1248:================================================== 18 1354 1354:================================================== 20 2746 2746:================================================== 22 3151 3151:================================================== 24 6401 6401:================================================== 26 5110 5110:================================================== 28 5724 5763:================================================== 30 3887 4303:================================================== 32 2238 2682:================================================== 34 1401 1735:================================================== 36 863 1144:================================================== 38 533 690:================================================== 40 346 411:================================================== 42 481 250:================================================== 44 419 166:================================================== 46 346 110:================================================== 48 295 84:------------------------------------------++++++++ 50 203 47:------------------------++++++++++++++++++++++++++ 52 184 46:-----------------------+++++++++++++++++++++++++++ 54 110 39:--------------------++++++++++++++++++++++++++++++ 56 82 9:-----++++++++++++++++++++++++++++++++++++ 58 69 8:----+++++++++++++++++++++++++++++++ 60 71 3:--++++++++++++++++++++++++++++++++++ 62 75 1:-+++++++++++++++++++++++++++++++++++++ 64 36 1:-+++++++++++++++++ 66 31 0:++++++++++++++++ 68 17 2:-++++++++ 70 12 0:++++++ 72 12 0:++++++ 74 6 0:+++ 76 10 1:-++++ 78 28 0:++++++++++++++ 80 2 0:+ > 80 19 5:---+++++++ 13464008 residues in 38303 sequences statistics exclude scores greater than 73 http://bimas.dcrt.nih.gov/fastainfo/fastaexample.html mean initn score: 26.8 (7.79) mean init1 score: 26.0 (6.05) 5349 scores better than 33 saved, ktup: 2, variable pamfact joining threshold: 28 scan time: 0:00:30 ©-------------------------------------------------------------------------------- CG 2015 The best scores are: initn init1 opt sp|P33013|YEEC_ECOLI HYPOTHETICAL 38.9 KD PROTEIN IN S 1422 1422 1422 Statistical significance (2) • Key question: how significant is the score x that was obtained? • Solution 2: - average score of random sequence; - standard dev. • Z-score: z = (x- ) / • Rule of thumb: z > 3 possibly significant, z>6 probably significant, z>10 significant • Issues: sensitivity vs selectivity. • Pertinence to biology is the bottom line © CG 2015 August 1997: NCBI Director David Lipman (far left) coaches Vice President Gore (seated) as he searches PubMed. NIH Director Harold Varmus (center) and NLM Director Donald Lindberg look on. © CG 2015 Bill Pearson Bill Pearson received his Ph.D. in Biochemistry in 1977 from the California Institute of Technology. He then did a post- doctoral fellowships at the Caltech Marine Station in Corona del Mar, CA and at the Department of Molecular Biology and Genetics at Johns Hopkins. In 1983 he joined the Department of Biochemistry at the University of Virginia. © CG 2015 BLAST Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers and Lipman 1990. • Motivation: Need to increase the speed of FASTA by finding fewer and better spots during the algorithm. • The Core of the Algorithm: Finding fewer and better hot spots, but not insisting on perfect matches in them. • Some statistical results on the significance of the results • Different versions for protein, DNA, … © CG 2015 © CG 2015 BLAST – outline • Compile a list of high scoring words with the query • Scan the database for hits • Extend hits © CG 2015 BLAST Algorithm 1 Query sequence of length L Maximum of L-w+1 words (typically w = 3 for proteins) For each word from the query sequence find the list of words with high score using a substitution matrix Word list BLAST Algorithm 2 Database sequences Word list Exact matches of words