09/01/2020

Key limitations of the Smith-Waterman local alignment algorithm

• Quadratic in time and space complexity • From Smith-Waterman to BLAST Report only one optimal alignment – Usually want all interesting alignments Jeremy Buhler (in absentia) – Example: map a mRNA against a genome

Query sequence Query

Paralogous features in genome Genome Wilson Leung 07/2015 1 2

BLAST alignment strategy: Smith-Waterman implementations generate and filter • Bill Pearson’s ssearch – https://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml Initial candidates Refined candidates Final Alignments • Water (EBI / EMBOSS)

– https://www.ebi.ac.uk/Tools/psa/emboss_water/ Query

• DeCypherSW from TimeLogic Waterman - Filtering

Database Smith

• Goal: minimize the need for calculating Smith-Waterman alignments http://www.timelogic.com/catalog/775

3 4

Challenges with the BLAST Identify candidate patterns alignment strategy • High-scoring alignment between two sequences will contain some consecutive matches 1. Identify candidate patterns • Treat k-mer (word) matches as candidates 2. Find the best alignment “near” a candidate …atacatcactacgatccacat atcc-a… (k = 4) …agacatgacat --tgcaatcccaatcc …

5 6

1 09/01/2020

Locate k-mer matches (k=3) Use a hash table to more efficiently store k-mers 1 2 3 4 5 6 7 k Query: a c g t a c g • A table of 4 entries is required to store all possible k-mers of a DNA query sequence 3-mers in query 3-mer matches in database 1 2 3 4 5 6 7 3-mer Position(s) a c g c g t a • BLAST uses a hash table to store k-mers acg 1, 5 – Space requirement proportional to the query size cgt 2 k-mer Database Query gta 3 match position(s) position(s) tac 4 acg 1 1, 5 • Reduces the time required to the sum of the cgt 4 2 lengths of the two sequences gta 5 3

7 8

k-mer size affects the sensitivity and Other “Build a Table” abstractions specificity of the search • Search multiple queries against a database – BLAT: index the database • How “good” are the candidate matches? • More space-efficient index structures – Suffix array • Trade off between sensitivity (true positives) – Burrows-Wheeler transform and specificity (true negatives) – FM-index § k = 1 (high sensitivity) • Used by second-generation sequence aligners § k = entire sequence (high specificity) (e.g., BWA, Bowtie) Li H and Homer N. A survey of algorithms for next-generation . Briefings in . 2010 Sep;11(5):473-83. 9 10

Quantifying specificity Quantifying sensitivity • Given DNA sequences S and T • Require at least one k-mer match to detect an – i.i.d. random with equal base frequencies alignment between S and T Probability of 1 bp match: • Sequences with lower percent identity have Probability of k-mer match: fewer k-mer matches

Expected number of k-mer matches: • How large a value of k is likely to detect most Search 1kb pattern against a 1Gb database: alignments?

11 12

2 09/01/2020

Word length versus probability of occurrence Adjust k-mer size based on the level of Target length (L) = 100 sequence similarity 1.0 0.9 80% identity • BLAT (k=15) 67% identity 0.8 – Find highly similar sequences 0.7 0.6 0.5 k=11 • blastn (k=11) 0.4 – Find most medium to high similarity alignments 0.3 – Most candidates are false positives 0.2 Probability of occurrence (L=100) 0.1 • RepeatMasker (k=8) 6 7 8 9 10 11 12 13 14 15 16 – Find highly diverged repeat copies Word length (k) 13 14

Use more sensitive parameters to identify the initial transcribed Word match for sequences • Use shorter k-mer: • Program Selection: – blastp (k=3) – From megablast to blastn • Word Size: • Allow approximate matches using similarity: – From 11 to 7 – Keep all word matches with score ≥ T (neighborhood) • Match/Mismatch Scores: – From +2/-3 to +1/-1 • Gap Costs: • Reduce number of spurious candidates: – Existence: from 5 to 2 – Require two word matches along the same – Extension: from 2 to 1 diagonal (two-hit algorithm) 15 16

Use dynamic programming (DP) to The “shadowing problem” filter candidates • A “good” alignment might be omitted because • Search the region surrounding each candidate of a better alignment within the search region Tandem duplication Missing exon • Define the size and shape of the search region Aʹ A Aʹ

No similarity • Report multiple high-scoring alignments exon gene - Query – Align a multi-exon mRNA against a genome A – Report alignments to all Multi Genome Genome Lower-scoring A is shadowed by higher-scoring Aʹ 17 18

3 09/01/2020

Solution: pin the alignment Define the size of the two search regions • Candidate match is • One option: bound the search regions by the centered on S[i], T[j] ends of the two sequences • Compute optimal – Best case: half of the entire DP matrix Ab alignments that pass through (i, j) – Worst case: cost as much as not filtering (i, j) – Half-anchor alignments A (i, j)

Sequence T Sequence Af Af = Best alignment that starts from (i, j)

Ab = Best alignment Sequence S that ends at (i, j) A = Best alignment (combine Af and Ab) DP fill region ≥ half of matrix

19 20

BLAST often chains multiple alignment The “chaining problem” blocks into a single alignment

• Opposite problem to shadowing tblastn of CaMKII-PA (query) against the D. mojavensis genome (subject) – Connect multiple features into a single alignment

Sequence 2

Junk! Sequence 1

Sequence 2

Feature A Feature B 1 Sequence

+40 -39 +40 Mills LJ and Pearson WR. Adjusting scoring matrices to correct overextended alignments. Bioinformatics. 2013 Dec 1;29(23):3007-13.

21 22

Ignore alignments that are Use banded alignments to reduce “not promising” the search space • Ignore alignments with very large gaps – Usually have poor score (i, j) 2b+1 – Can identify second feature from its own candidates

• Limit search region to the diagonal surrounding the candidate – The bandwidth (b) parameter controls the width of • Number of DP entries to compute is proportional the diagonal to the length of the shorter sequence (times b)

23 24

4 09/01/2020

Use X-drop to further reduce the BLAST X-drop strategy search space Final alignment • Terminate the alignment if the score drops below x compared to the optimal score σ

X Mi*,j* = σ (i*, j*) (u, v) Mu,v < σ - x Trim back to position with the highest score Af Cumulative score

If total score of Af is ≥ σ, the score of this piece must be > x Minimum score? Length of extension

Korf, I., Yandell, M., and Bedell, J. (2003). BLAST. O’Reilly Media, Inc. 25 26

Summary Questions?

• BLAST uses a generate and filter strategy – Generate candidate matches – Filter using dynamic programming (DP)

• Mitigates problems with shadowing and chaining

• Minimizes the amount of time spent on DP – Banded alignment – X-drop

27 28

5