02.01.15
Exact pa ern matching
• S = s1 s2 … sn (text) |S| = n (length) Text Algorithms • P = p1p2..pm (pa ern) |P| = m
• Σ - alphabet | Σ| = c Jaak Vilo 2014 fall • Does S contain P? – Does S = S' P S" fo some strings S' ja S"?
– Usually m << n and n can be (very) large Jaak Vilo MTAT.03.190 Text Algorithms 1
Find occurrences in text Algorithms
P One-pa ern Mul -pa ern • Brute force • Aho Corasick S • Knuth-Morris-Pra • Commentz-Walter • Karp-Rabin • Shi -OR, Shi -AND • Boyer-Moore Indexing • • Factor searches Trie (and suffix trie) • Suffix tree
Anima ons Brute force: BAB in text?
• h p://www-igm.univ-mlv.fr/~lecroq/string/ ABACABABBABBBA
• EXACT STRING MATCHING ALGORITHMS BAB Anima on in Java • Chris an Charras - Thierry Lecroq Laboratoire d'Informa que de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-Saint-Aignan Cedex FRANCE • e-mails: {Chris an.Charras, Thierry.Lecroq} @laposte.net
1 02.01.15
Brute Force Brute force
attempt 1: gcatcgcagagagtatacagtacg P Algorithm Naive GCAg.... i i+j-1 attempt 2: Input: Text S[1..n] and gcatcgcagagagtatacagtacg S pa ern P[1..m] g......
attempt 3: Output: All posi ons i, where gcatcgcagagagtatacagtacg g...... j P occurs in S attempt 4: gcatcgcagagagtatacagtacg Iden fy the first mismatch! g......
for( i=1 ; i <= n-m+1 ; i++ ) attempt 5: gcatcgcagagagtatacagtacg for ( j=1 ; j <= m ; j++ ) g......
if( S[i+j-1] != P[j] ) break ; attempt 6: gcatcgcagagagtatacagtacg Ques on: GCAGAGAG if ( j > m ) print i ; attempt 7: gcatcGCAGAGAGtatacagtacg
§Problems of this method? L g...... §Ideas to improve the search? J
Brute force or NaiveSearch C code
int bf_2( char* pat, char* text , int n ) /* n = textlen */ 1 func on NaiveSearch(string s[1..n], string sub[1..m]) { int m, i, j ; 2 for i from 1 to n-m+1 int count = 0 ; m = strlen(pat); 3 for j from 1 to m for ( i=0 ; i + m <= n ; i++) { 4 if s[i+j-1] ≠ sub[j] for( j=0; j < m && pat[j] == text[i+j] ; j++) ; 5 jump to next itera on of outer loop if( j == m ) 6 return i count++ ; } 7 return not found return(count); }
C code Main problem of Naive int bf_1( char* pat, char* text ) { P int m ; i i+j-1 int count = 0 ; char *tp; S
m = strlen(pat); j tp=text ; S
for( ; *tp ; tp++ ) { j if( strncmp( pat, tp, m ) == 0 ) { count++ ; } } • For the next possible loca on of P, check
return( count ); again the same posi ons of S }
2 02.01.15
D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. Goals Knuth-Morris-Pra SIAM Journal on Computing 6:323-350, 1977.
• Make sure that no comparisons “wasted” • Make sure only a constant nr of comparisons/ x opera ons is made for each posi on in S ≠ y – Move (only) from le to right in S
– How? • A er such a mismatch we already know – A er a test of S[i] <> P[j] what do we now ? exactly the values of green area in S !
D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. Knuth-Morris-Pra SIAM Journal on Computing 6:323-350, 1977. Automaton for ABCABD
• Make sure that no comparisons “wasted” NOT A
A B C A B D prefix x 1 2 3 4 5 6 7 ≠ prefix y ≠ p z
• P – longest suffix of any prefix that is also a prefix of a pa ern • Example: ABCABD ABCABD
Automaton for ABCABD KMP matching
NOT A Input: Text S[1..n] and pa ern P[1..m] Output: First occurrence of P in S (if exists) A B C A B D 1 2 3 4 5 6 7 i=1; j=1; initfail(P) // Prepare fail links repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern Fail: 0 1 1 1 2 3 1 else j = fail[j] // use fail link until j>m or i>n if j>m then report match at i-m Pattern: A B C A B D
1 2 3 4 5 6
3 02.01.15
Ini aliza on of fail links Ini aliza on of fail links i Algorithm: KMP_Ini ail ABCABD i=1, j=0 , fail[1]= 0 j Input: Pa ern P[1..m] repeat Output: fail[] for pa ern P if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j Fail: 0 else j = fail[j] i=1, j=0 , fail[1]= 0 until i>=m repeat 0 1 if j==0 or P[i] == P[j] 0 1 1 1 then i++ , j++ , fail[i] = j else j = fail[j] ABCABD until i>=m
0 1 1 1 2
Time complexity of KMP matching? Analysis of me complexity
Input: Text S[1..n] and pa ern P[1..m] • At every cycle either i and j increase by 1 Output: First occurrence of P in S (if exists) • Or j decreases (j=fail[j]) i=1; j=1; initfail(P) // Prepare fail links repeat • i can increase n (or m) mes if j==0 or S[i] == P[j] • Q: How o en can j decrease? then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link – A: not more than nr of increases of i until j>m or i>n if j>m then report match at i-m • Amor sed analysis: O(n) , preprocess O(m)
Karp-Rabin Karp-Rabin R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260. IBM Journal of Research and Development 31 (1987), 249-260.
• Compare in O(1) a hash of P and S[i..i+m-1] • Compare in O(1) a hash of P and S[i..i+m-1]
i .. (i+m-1) i .. (i+m-1) i .. (i+m-1)
h(T[i.. i+m-1]) h(T[i+1..i+m])
h(P) h(P)
1..m 1..m • Goal: O(n). • Goal: O(n). • f( h(T[i.. i+m-1]) -> h(T[i+1..i+m]) ) = O(1) • f( h(T[i.. i+m-1]) -> h(T[i+1..i+m]) ) = O(1)
4 02.01.15
Hash Let’s use numbers
• “Remove” the effect of T[i] and “Introduce” • T= 57125677 the effect of T[i+m] – in O(1) • P= 125 (and for simplicity, h=125)
• Use base |Σ| arithme cs and treat charcters • H(T[1])=571 as numbers • H(T[2])= (571-5*100)*10 + 2 = 712
• In case of hash match – check all m posi ons • H(T[3]) = (H(T[2]) – ord(T[1])*10m)*10 + T[3+m-1] • Hash collisions => Worst case O(nm)
hash
• c – size of alphabet • hash(w[0 .. m-1])=(w[0]*2m-1+ w[1]*2m-2+···+ w[m-1]*20) mod q
• HSi = H( S[i..i+m-1] )
• H( S[i+1 .. i+m ] ) = ( HSi – ord(S[i])* cm-1 ) * c + ord( S[i+m] )
• Modulo arithme c – to fit value in a word!
Karp-Rabin More ways to ensure O( n ) ?
Input: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */ 2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0 5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position
8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q 10. if hp == hs and P == S[i..i+m-1] 11. report match at position i
5 02.01.15
Shi -AND / Shi -OR Bit-opera ons
• Ricardo Baeza-Yates , Gaston H. Gonnet • Maintain a set of all prefixes that have so far A new approach to text searching had a perfect match Communica ons of the ACM October 1992, Volume 35 Issue 10 [ACM Digital Library:h p://doi.acm.org/10.1145/135239.135243] [DOI] • On the next character in text update all previous pointers to a new set • PDF
• Bit vector: for every possible character
State: which prefixes match? Move to next:
0 1
0 0
1
Track posi ons of prefix matches Vectors for every char in Σ
• P=aste
0 1 0 1 0 1 a s t e b c d .. z 1 0 1 0 1 1 Shift left << 1 0 0 0 0 ... Mask on char T[i] 1 0 0 0 1 1 Bitwise AND 0 1 0 0 0 ...
1 0 0 0 1 1 0 0 1 0 0 ... 0 0 0 1 0 ...
6 02.01.15
• T= lasteaed • T= lasteaed
l a s t e a e d l a s t e a e d 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
• T= lasteaed • T= lasteaed
l a s t e a e d l a s t e a e d 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0
Summary
Algorithm Worst case Ave. Case Preprocess
Brute force O(mn) O(n * (1+1/|Σ|+..)
Knuth-Morris-Pra O(n) O(n) O(m)
Rabin-Karp O(mn) O(n) O(m)
Boyer-Moore O(n/m) ?
BM Horspool
Factor search
Shi -OR O(n) O(n) O( m|Σ| )
7 02.01.15
Find occurrences in text
• R. Boyer, S.Moore: A fast string searching P algorithm. CACM 20 (1977), 762-772 [PDF] • http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf S
• Have we missed anything?
43 44
Find occurrences in text
P
S ABCDE BBCDE
• What have we learned if we test for a poten al match from the end?
45 46
Bad character heuris cs Find occurrences in text maximal shi on S[i]
P P S[i]
S A S A X B B
S X X First x in pattern (from end)
delta1( S[i] ) – |m| if pattern does not contain S[i] patlen-j max j so that P[j] == S[i]
47 48
8 02.01.15
Good suffix heuris cs
void bmInitocc() { P
char a; int j; S A µ for(a=0; a 49 50 Boyer-Moore algorithm Input: Text S[1..n] and pattern P[1..m] • h p://www.i . -flensburg.de/lang/algorithmen/pa ern/bmen.htm Output: Occurrences of P in S • h p://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Ar cles/Exact/Boyer-Moore-original-p762-boyer.pdf preprocess_BM() // delta1 and delta2 i=m • Anima on: h p://www-igm.univ-mlv.fr/~lecroq/string/ while i <= n for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ; if j==0 report match at position i-m+1 i = i+ max( delta1[ S[i] ], delta2[ j ] ) 51 52 Simplifica ons of BM Algorithm BMH (Boyer-Moore-Horspool) • There are many variants of Boyer-Moore, and • RN Horspool - Prac cal Fast Searching in Strings So ware - Prac ce and Experience, 10(6):501-506 1980 many scien fic papers. Input: Text S[1..n] and pattern P[1..m] • On average the me complexity is sublinear Output: occurrences of P in S 1. for a in Σ do delta[a] = m • Algorithm speed can be improved and yet 2. for j=1..m-1 do delta[P[j]] = m-j simplify the code. 3. i=m 4. while i <= n • It is useful to use the last character heuris cs 5. if S[i] == P[m] 6. j = m-1 (Horspool (1980), Baeza-Yates(1989), Hume 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; and Sunday(1991)). 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ] 53 54 9 02.01.15 String Matching: Horspool algorithm Algorithm Boyer-Moore-Horspool-Hume-Sunday (BMHHS) • How the comparison is made? • Use delta in a ght loop Text : • If match (delta==0) then check and apply original delta d Pattern : From right to left: suffix search Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m • Which is the next position of the window? 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m] Text : a 4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 0 5. for ( i=m ; i<= n ; i = i+d ) Pattern : 6. repeat // skip loop It depends of where appears the last letter of the text, say it ‘a’, in the pattern: 7. t=delta[ S[i] ] ; i = i + t 8. until t==0 a a a 9. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ; 10. if j==0 report match at i-m+1 a a a a a a BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P (in order for the algorithm to finish correctly – at least one occurrence!). Then it is necessary a preprocess that determines the length of the shift. 56 • Daniel M. Sunday: A very fast substring search algorithm [PDF] Communica ons of the ACM August 1990, Volume 33 Issue 8 • Loop unrolling: • Avoid too many loops (each loop requires tests) by just repea ng code within the loop. • Line 7 in previous algorithm can be replaced by: 7. i += delta[ S[i] ]; i += delta[ S[i] ]; i += (t = delta[ S[i] ]) ; 57 58 Forward-Fast-Search: Another Fast Variant of the Boyer-Moore String Matching Algorithm Factor based approach • The Prague Stringology Conference '03 • Op mal average-case algorithms • Domenico Cantone and Simone Faro • Abstract: We present a varia on of the Fast-Search string matching – Assuming independent characters, same probability algorithm, a recent member of the large family of Boyer-Moore-like algorithms, and we compare it with some of the most effec ve string matching algorithms, such as Horspool, Quick Search, Tuned Boyer- Moore, Reverse Factor, Berry-Ravindran, and Fast-Search itself. All algorithms are compared in terms of run- me efficiency, number of text • Factor – a substring of a pa ern character inspec ons, and number of character comparisons. It turns out that our new proposed variant, though not linear, achieves very good – Any substring results especially in the case of very short pa erns or small alphabets. – (how many?) • h p://cs.felk.cvut.cz/psc/event/2003/p2.html • PS.gz (local copy) 59 60 10 02.01.15 Factor based approach Factor searches • Op mal average-case algorithms P – Assuming independent characters, same probability u S X Do not compare characters, but find the longest match to any subregion of the pattern. 61 62 Examples Backward DAWG Matching BDM Suffix automaton recognises all factors (and suffixes) in O(n) • Backward DAWG Matching (BDM) – Crochemore et al 1994 • Backward Nondeterminis c DAWG Matching (BNDM) – Navarro, Raffinot 2000 • Backward Oracle Matching (BOM) – Allauzen, Crochermore, Raffinot 2001 Do not compare characters, but find the longest match to any 63 subregion of the pattern. 64 BNDM – simulate using bitparallelism BNDM matches an NDA Bits – show where the factors have occurred so far NDA on the suffixes of ‘announce’ 65 66 11 02.01.15 Determinis c version of the same BNDM – Backward Non-Deterministic DAWG Matching Backward Factor Oracle BOM - Backward Oracle matching 67 68 String Matching of one pa ern Mul ple pa erns S CTACTACTACGTCTATACTGATCGTAGC TACTACGGTATGACTAA 1. Prefix search {P} 2. Suffix search 3. Factor search Why? Algorithms • Mul ple pa erns • Aho-Corasick (search for mul ple words) • Highlight mul ple different search words on the page – Generaliza on of Knuth-Morris-Pra • Virus detec on – filter for virus signatures • • Spam filters Commentz-Walter • Scanner in compiler needs to search for mul ple keywords – Generaliza on of Boyer-Moore & AC • Filter out stop words or disallowed words • Wu and Manber • Intrusion detec on so ware – improvement over C-W • Next-genera on sequencing produces huge amounts (many millions) of short reads (20-100 bp) that need to be mapped to genome! • Addi onal methods, tricks and techniques • … 12 02.01.15 Aho-Corasick (AC) • Alfred V. Aho and Margaret J. Corasick (Bell Labs, Murray Hill, NJ) • Efficient string matching. An aid to bibliographic search. Generaliza on of KMP for many pa erns Communica ons of the ACM, Volume 18 , Issue 6, p333-340 (June 1975) • Text S like before. • ACM:DOI PDF • ABSTRACT This paper describes a simple, efficient algorithm to locate all • Set of pa erns P = { P1 , .., Pk } occurrences of any of a finite number of keywords in a string of text. The algorithm consists of construc ng a finite state pa ern matching machine • Total length |P| = m = Σ i=1..k mi from the keywords and then using the pa ern matching machine to process the text string in a single pass. Construc on of the pa ern • Problem: find all occurrences of any of the matching machine takes me propor onal to the sum of the lengths of the keywords. The number of state transi ons made by the pa ern matching Pi ∈ P from S machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10. References: Idea PATRICIA trie • D. R. Morrison, "PATRICIA: Prac cal Algorithm To Retrieve Informa on 1. Create an automaton from all pa erns Coded In Alphanumeric", Journal of the ACM 15 (1968) 514-534. • Abstract PATRICIA is an algorithm which provides a flexible means of storing, indexing, and retrieving informa on in a large file, which is 2. Match the automaton economical of index space and of reindexing me. It does not require rearrangement of text or index as new material is added. It requires a minimum restric on of format of text and of keys; it is extremely flexible in the variety of keys it will respond to. It retrieves informa on in response to keys furnished by the user with a quan ty of computa on which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several varia ons as FORTRAN • Use the PATRICIA trie for crea ng the main structure programs for the CDC-3600, u lizing disk file storage of text. It has been of the automaton applied to several large informa on-retrieval problems and will be applied to others. • ACM:DOI PDF • Word trie - a good data structure to represent a set of words (e.g. a dic onary). • trie (data structure) • Defini on: A tree for storing strings in which there is one node for every common preffix. The strings are stored in extra leaf nodes. • See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree. • Note: The name comes from retrieval and is pronounced, "tree." • To test for a word p, only O(|p|) me is used no ma er how many words are in the dic onary... 13 02.01.15 Trie for P={he, she, his, hers} Trie for P={he, she, his, hers} 0 h s 0 0 h h s 1 3 i e h 1 1 3 e e h 2 6 4 r s e 2 2 4 e 8 5 s 7 5 9 How to search for words like he, sheila, hi. Do these occur in the trie? Aho-Corasick 0 1. Create an automaton MP for a set of strings P. h s 2. Finite state machine: read a character from text, 1 3 and change the state of the automaton based on i the state transi ons... e h 3. Main links: goto[j,c] - read a character c from text 2 6 4 and go from a state j to state goto[j,c]. r s e 4. If there are no goto[j,c] links on character c from 8 state j, use fail[j]. 7 5 s 5. Report the output. Report all words that have been found in state j. 9 AC Automaton (vs KMP) AC - matching NOT { h, s } Input: Text S[1..n] and an AC automaton M for pa ern set P goto[1,i] = 6. ; Output: Occurrences of pa erns from P in S (last posi on) fail[7] = 3, 0 1. state = 0 fail[8] = 0 , h s fail[5]=2. 2. for i = 1..n do 1 3 e i h 3. while (goto[state,S[i]]== ∅ ) and (fail[state]!=state) do 2 Output table 4 4. state = fail[state] r state output[j] e 5. state = goto[state,S[i]] 2 he 6 5 she, he 8 5 s s 6. if ( output[state] not empty ) 7 his 9 hers 7. then report matches output[state] at posi on i 9 7 14 02.01.15 Algorithm Aho-Corasick preprocessing I (TRIE) Input: P = { P , ... , P } Preprocessing II for AC (FAIL) 1 k Output: goto[] and par al output[] Assume: output(s) is empty when a state s is created; queue = ∅ goto[s,a] is not defined. for a ∈ Σ do if goto[0,a] ≠ 0 then procedure enter(a1, ... , am) /* Pi = a1, ... , am */ begin begin 10. news = 0 enqueue( queue, goto[0,a] ) fail[ goto[0,a] ] = 0 1. s=0; j=1; 11. for i=1 to k do enter( Pi ) 12. for a ∈ Σ do 2. while goto[s,aj] ≠ ∅ do // follow exis ng path 3. s = goto[s,a ]; while queue ≠ ∅ j ∅ 4. j = j+1; 13. if goto[0,a] = then goto[0,a] = 0 ; r = take( queue ) end 5. for p=j to m do // add new path (states) for a ∈ Σ do 6. news = news+1 ; if goto[r,a] ≠ ∅ then s = goto[ r, a ] 7. goto[s,ap] = news; enqueue( queue, s ) // breadth first search 8. s = news; state = fail[r] 9. output[s] = a1, ... , am while goto[state,a] = ∅ do state = fail[state] end fail[s] = goto[state,a] output[s] = output[s] + output[ fail[s] ] Correctness AC matching me complexity • Theorem For matching the MP on text S, |S|=n, • Let string t "point" from ini al state to state j. less than 2n transi ons within M are made. • Proof Compare to KMP. • Must show that fail[j] points to longest suffix • There is at most n goto steps. that is also a prefix of some word in P. • Cannot be more than n Fail-steps. • In total -- there can be less than 2n transi ons in • Look at the ar cle... M. Individual node (goto) AC thoughts • Full table • Scales for many strings simultaneously. • For very many pa erns – search me (of grep) improves(??) – See Wu-Manber ar cle • List • When k grows, then more fail[] transi ons are made (why?) • But always less than n. • If all goto[j,a] are indexed in an array, then the size is • Binary search tree(?) |MP|*|Σ|, and the running me of AC is O(n). • When k and c are big, one can use lists or trees for storing transi on func ons. • Some other index? • Then, O(n log(min(k,c)) ). 15 02.01.15 Advanced AC Problems of AC? • Precalculate the next state transi on correctly for every possible character in alphabet • Need to rebuild on adding / removing pa erns • Can be good for short pa erns • Details of branching on each node(?) C-W descrip on Commentz-Walter • Generaliza on of Boyer-Moore for mul ple • Aho and Corasick [AC75] presented a linear- me sequence search algorithm for this problem, based on an automata • Beate Commentz-Walter approach. This algorithm serves as the basis for the A String Matching Algorithm Fast on the Average UNIX tool fgrep. A linear- me algorithm is op mal in Proceedings of the 6th Colloquium, on Automata, the worst case, but as the regular string-searching Languages and Programming. Lecture Notes In algorithm by Boyer and Moore [BM77] Computer Science; Vol. 71, 1979. pp. 118 - 132, demonstrated, it is possible to actually skip a large Springer-Verlag por on of the text while searching, leading to faster than linear algorithms in the average case. • h p://www. -albsig.de/win/personen/professoren.php?RID=36 • You can download here my algorithm StringMatchingFastOnTheAverage (PDF, ~17,2 MB) or here StringMatchingFastOnTheAverage (extended abstract) (PDF, ~3 MB) Commentz-Walter [CW79] Idea of C-W • Commentz-Walter [CW79] presented an algorithm for the • Build a backward trie of all keywords mul -pa ern matching problem that combines the Boyer- Moore technique with the Aho-Corasick algorithm. The Commentz-Walter algorithm is substan ally faster than the • Match from the end un l mismatch... Aho-Corasick algorithm in prac ce. Hume [Hu91] designed a tool called gre based on this algorithm, and version 2.0 of fgrep by the GNU project [Ha93] is using it. • Determine the shi based on the combina on • Baeza-Yates [Ba89] also gave an algorithm that combines the of heuris cs Boyer-Moore-Horspool algorithm [Ho80] (which is a slight varia on of the classical Boyer-Moore algorithm) with the Aho-Corasick algorithm. 16 02.01.15 Horspool for many pa erns Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A 1. Build the trie of the inverted patterns T G G T A T G T A T A A T A A A 1 T A C 4 (lmin) G G T A The text ACATGCTATGTGACA… G 2 T A A T A T 1 A 2. lmin=4 A 1 3. Table of shifts C 4 (lmin) G 2 T 1 4. Start the search Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG Search for ATGTATG,TATG,ATAAT,ATGTG T G T A T G T A A A T T G G T A G G T A T A A T A A T A A 1 T A A 1 A C 4 (lmin) A C 4 (lmin) The text ACATGCTATGTGACA… G 2 The text ACATGCTATGTGACA… G 2 T 1 T 1 Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG Search for ATGTATG,TATG,ATAAT,ATGTG T G T A T G T A A A T T G G T A G G T A T A A T A A T A A 1 T A A 1 A C 4 (lmin) A C 4 (lmin) The text ACATGCTATGTGACA… G 2 The text ACATGCTATGTGACA… G 2 T 1 T 1 Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) 17 02.01.15 Horspool for many pa erns Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG Search for ATGTATG,TATG,ATAAT,ATGTG T G T A T G T A A A T T G G T A G G T A T A A T A A T A A 1 T A A 1 A C 4 (lmin) A C 4 (lmin) The text ACATGCTATGTGACA… G 2 The text ACATGCTATGTGACA… G 2 T 1 T 1 Short Shifts! … … Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) What are the possible limita ons Wu-Manber for C-W? • Wu S., and U. Manber, "A Fast Algorithm for Mul -Pa ern Searching," Technical Report TR-94-17, Department of Computer Science, University • Many pa erns, small alphabet – minimal skips of Arizona (May 1993). • Citeseer: h p://citeseer.ist.psu.edu/wu94fast.html [Postscript] • We present a different approach that also uses the ideas of Boyer and Moore. Our algorithm is quite simple, and the main engine of it is given • What can be done differently? later in the paper. An earlier version of this algorithm was part of the second version of agrep [WM92a, WM92b], although the algorithm has not been discussed in [WM92b] and only briefly in [WM92a]. The current version is used in glimpse [MW94]. The design of the algorithm concentrates on typical searches rather than on worst-case behavior. This allows us to make some engineering decisions that we believe are crucial to making the algorithm significantly faster than other algorithms in prac ce. Key idea • Main problem with Boyer-Moore and many • Instead of looking at characters from the text one by pa erns is that, the more there are pa erns, one, we consider them in blocks of size B. the shorter become the possible shi s... • logc2M; in prac ce, we use either B = 2 or B = 3. • The SHIFT table plays the same role as in the regular Boyer-Moore algorithm, except that it determines • Wu and Manber: check several characters the shi based on the last B characters rather than simultaneously, i.e. increase the alphabet. just one character. 18 02.01.15 Horspool to Wu-Manber Wu-Manber algorithm How do we can increase the length of the shifts? Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T With a table shift of l-mers with the patterns G G T A ATGTATG,TATG,ATAAT,ATGTG T A A T A AA 1 1 símbol A AT 1 2 símbols GT 1 A 1 into the text: ACATGCTATGTGACATAATA TA 2 C 4 (lmin) AA 1 G 2 AC 3 (LMIN-L+1) TG 2 T 1 AG 3 AT 1 AA 1 CA 3 AT 1 CC 3 GT 1 CG 3 TA 2 … TG 2 … Experimental length: log|Σ| 2*lmin*r Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Backward Oracle • Set Backwards oracle SBDM, SBOM • Pages 68-72 String matching of many pa erns String matching of many pa erns | Σ| | Σ| 8 (5 patterns) 8 (5 patterns) Wu-Manber Wu-Manber 4 4 SBOM SBOM 2 Lmin 2 Lmin 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 8 8 Wu-Manber Wu-Manber (10 patterns) (10 patterns) 4 4 SBOM SBOM 2 2 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 (1000 patterns) 8 Wu-Manber 8 SBOM (100 patterns) (100 patterns) 4 4 SBOM 2 2 Slides courtesy of: 5Xavier Messeguer 10 Peypoch (http:// 15www.lsi.upc.es /~alggen 20 ) 25 30 35 40 45 Slides courtesy of: 5Xavier Messeguer 10 Peypoch (http:// 15www.lsi.upc.es /~alggen 20 ) 25 30 35 40 45 19 02.01.15 5 strings 10 strings 100 strings 1000 strings Factor Oracle Factor Oracle: safe shi 20 02.01.15 Factor Oracle: Factor oracle Shift to match prefix of P2? Construc on of factor Oracle Factor oracle • Allauzen, C., Crochemore, M., and Raffinot, M. 1999. Factor Oracle: A New Structure for Pa ern Matching. In Proceedings of the 26th Conference on Current Trends in theory and Prac ce of informa cs on theory and Prac ce of informa cs (November 27 - December 04, 1999). J. Pavelka, G. Tel, and M. Bartosek, Eds. Lecture Notes In Computer Science, vol. 1725. Springer- Verlag, London, 295-310. • h p://portal.acm.org/cita on.cfm? id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=6 1811641# • h p://www-igm.univ-mlv.fr/~allauzen/work/sofsem.ps 21 02.01.15 So far Mul ple Shi -AND • Generalised KMP -> AhoCorasick • P={P1, P2, P3, P4}. Generalize Shi -AND • Generalised Horspool -> CommentzWalter, WuManber • Bits = P4 P3 P2 P1 • BDM, BOM -> Set Backward Oracle Matching… • Start = 1 1 1 1 • Other generalisa ons? • Match= 1 1 1 1 Problem • Given P and S – find all exact or approximate Text Algorithms (6EAP) occurrences of P in S Full text indexing • You are allowed to preprocess S (and P, of course) Jaak Vilo 2010 fall • Goal: to speed up the searches Jaak Vilo MTAT.03.190 Text Algorithms 129 E.g. Dic onary problem Sorted array and binary search • Does P belong to a dic onary D={d1,…,dn} 13 – Build a binary search tree of D 1 – B-Tree of D – Hashing global info – Sor ng + Binary search head search stop • he Build a keyword trie: search in O(|P|) header informal hers – Assuming alphabet has up to a constant size c index happy his show – See Aho-Corasick algorithm, Trie construc on 22 02.01.15 Sorted array and binary search Trie for D={ he, hers, his, she} 0 1 13 h s 1 3 i e h global info search head 2 6 4 he stop r s header informal e hers index happy his show 8 5 s 7 O( |P| log n ) 9 O( |P| ) S != set of words • S of length n Trie(babaab) a b • How to index? babaab b a abaab a a b baab b • Index from every posi on of a text aab a a a ab b b a b • Prefix of every possible suffix is important b Suffix tree Literature on suffix trees • • Defini on: A compact representa on of a trie corresponding to the h p://en.wikipedia.org/wiki/Suffix_tree suffixes of a given string where all nodes with one child are merged with • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer their parents. Science and Computa onal Biology. Hardcover - 534 pages 1st edi on • Defini on ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: rooted, labeled tree with a leaf for each non-empty suffix of S. 89--208) Furthermore, a suffix tree sa sfies the following proper es: • E. Ukkonen. On-line construc on of suffix trees. Algorithmica, 14:249-60, • Each internal node, other than the root, has at least two children; 1995. h p://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751 • • Each edge leaving a par cular node is labeled with a non-empty substring Ching-Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Construc ng Suffix Tree of S of which the first symbol is unique among all first symbols of the edge for Gigabyte Sequences with Megabyte Memory," IEEE Transac ons on labels of the edges leaving this par cular node; Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-105, January, 2005. h p://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3 • For any leaf in the tree, the concatena on of the edge labels on the path • CPM ar cles archive: h p://www.cs.ucr.edu/~stelo/cpm/ from the root to this leaf exactly spells out a non-empty suffix of s. • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and • Mark Nelson. Fast String Searching With Suffix Trees Dr. Computa onal Biology. Hardcover - 534 pages 1st edi on (January 15, 1997). Dobb's Journal, August, 1996. Cambridge Univ Pr (Short); ISBN: 0521585198. h p://www.dogma.net/markn/ar cles/suffixt/suffixt.htm 23 02.01.15 • STXXL : Standard Template Library for Extra Large Data Sets. http://stackoverflow.com/questions/9452701/ • h p://stxxl.sourceforge.net/ ukkonens-suffix-tree-algorithm-in-plain- english • ANSI C implementa on of a Suffix Tree • h p://yeda.cs.technion.ac.il/~yona/suffix_tree/ These are old links – check for newer… The suffix tree Tree(T) of T Suffix trie and suffix tree abaab • data structure suffix tree, Tree(T), is baab compacted trie that represents all the suffixes aab of string T ab b • linear size: |Tree(T)| = O(|T|) Trie(abaab) Tree(abaab) • can be constructed in linear me O(|T|) a b a baab b • has myriad virtues (A. Apostolico) a a ab a baab • is well-known: 366 000 Google hits b a a b b E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt141 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt142 Algorithms for combinatorial string Suffix tree and suffix array matching? techniques for pa ern analysis in • deep beauty? +- strings • shallow beauty? + • applica ons? ++ • intensive algorithmic miniatures Esko Ukkonen • sources of new problems: Univ Helsinki text processing, DNA, music,… Erice School 30 Oct 2005 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt143 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt144 24 02.01.15 ttttttttttttttgagacggagtctcgctctg tcgcccaggctggagtgcagtggcggg High-throughput genome-scale atctcggctcactgcaagctccgcctcc sequence analysis and mapping cgggttcacgccattctcctgcctcagcc using compressed data tcccaagtagctgggactacaggcgcc Veli Mäkinen structures cgccactacgcccggctaattttttgtattt Department of Computer Science ttagtagagacggggtttcaccgttttagc University of Helsinki cgggatggtctcgatctcctgacctcgtg atccgcccgcctcggcctcccaaagtgc tgggattacaggcgt E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt146 Solu on: backtracking with suffix Analysis of a string of symbols tree • T = h a t t i v a t t i ’text’ • P = a t t ’pa ern’ • Find the occurrences of P in T: h a t t i v a t t i • Pa ern synthesis: #(t) = 4 #(a ) = 2 #(t****t) = 2 ...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG..... E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt147 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 148 Pa ern finding & synthesis problems • T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet 1. Suffix tree • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different 2. Suffix array pa erns P can be found fast – sta c text, any given pa ern P 3. Some applications • Pa ern synthesis problem: Learn from T new pa erns that occur surprisingly o en 4. Finding motifs • What is a pa ern? Exact substring, approximate substring, with generalized symbols, with gaps, … 149 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt150 25 02.01.15 The suffix tree Tree(T) of T Suffix trie and suffix tree abaab • data structure suffix tree, Tree(T), is baab compacted trie that represents all the suffixes aab of string T ab b • linear size: |Tree(T)| = O(|T|) Trie(abaab) Tree(abaab) • can be constructed in linear me O(|T|) a b a baab b • has myriad virtues (A. Apostolico) a a ab a baab • is well-known: 366 000 Google hits b a a b b E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt151 152 Trie(T) can be large Tree(T) is of linear size • |Trie(T)| = O(|T|2) • only the internal branching nodes and the • bad example: T = anbn leaves represented explicitly • Trie(T) can be seen as a DFA: language • edges labeled by substrings of T accepted = the suffixes of T • v = node(α) if the path from root to v spells α • minimize the DFA => directed cyclic word • one-to-one correspondence of leaves and graph (’DAWG’) suffixes • |T| leaves, hence < |T| internal nodes • |Tree(T)| = O(|T| + size(edge labels)) 153 154 Tree(ha va ) Tree(ha va ) hattivatti hattivatti vatti vatti attivatti attivatti t i vatti t i vatti ttivatti ttivatti ti i ti i tivatti atti i tivatti atti i vatti vatti ivatti hattivatti ivatti hattivatti ivatti ivatti vatti ti vatti ti vatti vatti vatti vatti atti vatti tti atti vatti tti tti atti tti atti tivatti tivatti ti hattivatti ttivatti ti hattivatti ttivatti attivatti attivatti i i substring labels of edges represented as hattivatti pairs of pointers 155 156 26 02.01.15 Tree(ha va ) Tree(T) is full text index hattivatti vatti Tree(T) attivatti i 3,3 6 ttivatti 4,5 10 P tivatti 2,5 i vatti ivatti hattivatti P occurs in T at locations 8, 31, … vatti 9 5 6,10 vatti atti 6,10 8 tti 7 31 8 4 ti 1 3 2 P occurs in T ó P is a prefix of some suffix of T i ó Path for P exists in Tree(T) hattivatti All occurrences of P in time O(|P| + #occ) 157 158 Find a from Tree(ha va ) Linear me construc on of Tree(T) hattivatti vatti attivatti hattivatti t vatti ttivatti attivatti i ti ttivatti tivatti atti i vatti ivatti hattivatti tivatti ivatti vatti ti ivatti Weiner McCreight vatti vatti vatti (1973), (1976) atti vatti tti atti tti atti ’algorithm 7 tivatti tti ti hattivatti ttivatti of the attivatti ti year’ i 2 ’on-line’ algorithm i 159 (Ukkonen 1992) 160 On-line construc on of Trie(T) Trie(abaab) Trie(a) Trie(ab) Trie(aba) • T = t1t2 … tn$ abaa a a b a b baa • Pi = t1t2 … ti i:th prefix of T b b aa a εa • on-line idea: update Trie(Pi) to Trie(Pi+1) a ε • => very simple construc on chain of links connects the end points of current suffixes 161 162 27 02.01.15 Trie(abaab) Trie(abaab) Trie(abaa) Trie(abaa) a a b a b a b a a b a b a b b b b b b b a a a a a a a a a a a a a a Add next symbol = b 163 164 Trie(abaab) Trie(abaab) Trie(abaa) a a b a b a b a a b a b a b b b b b b b a a a a a a a a a a a a a a Trie(abaab) a b b Add next symbol = b a a a b a From here on b-arc already exists a b b 165 166 What happens in Trie(Pi) => Trie(Pi+1) ? What happens in Trie(Pi) => Trie(Pi+1) ? Before • me: O(size of Trie(T)) a From here on the ai-arc i exists already => stop • suffix links: updating here slink(node(aα)) = node(α) After ai ai ai ai ai New suffix links New nodes 167 168 28 02.01.15 On-line procedure for suffix trie Suffix trees on-line 1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix • ’compacted version’ of the on-line trie links slink(v) := root and slink(root) := root construc on: simulate the construc on on the 2. for i := 2 to n do begin linear size tree instead of the trie => me O(| 3. v := leaf of Trie(t …t ) for string t …t (i.e., the deepest leaf) i-1 1 i-1 1 i-1 T|) 4. v := vi-1; v´ := 0 5. while node v has no outgoing arc for ti do begin • all trie nodes are conceptually s ll needed => 6. Create a new node v´´ and an arc son(v,ti) = v´´ implicit and real nodes 7. if v´ ≠ 0 then slink(v) := v´´ 8. v := slink(v); v´ := v´´ end 9. for the node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´ 169 170 Implicit and real nodes Implicit node • Pair (v,α) is an implicit node in Tree(T) if v is a node of Tree and α is a (proper) prefix of the label of some arc from v. If α is the empty … α´ string then (v, α) is a ’real’ node (= v). • v Let v = node(α´) in Tree(T). Then implicit node … (v, α) represents node(α´α) of Trie(T) α (v, α) 171 172 Suffix links and open arcs Big picture root suffix link path traversed: total work O(n) α … aα … slink(v) v … … label [i,*] instead of [i,j] if … … … w is a leaf and j is the scanned position of T w new arcs and nodes created: 173 total work O(size(Tree(T)) 174 29 02.01.15 Suffix-tree on-line: main procedure On-line procedure for suffix tree Create Tree(t1); slink(root) := root (v, α) := (root, ε) /* (v, α) is the start node */ Input: string T = t t … t $ 1 2 n for i := 2 to n+1 do Output: Tree(T) v´ := 0 Notation: son(v,α) = w iff there is an arc from v to w with label α while there is no arc from v with label prefix αti do son(v,ε) = v if α ≠ ε then /* divide the arc w = son(v, αη) into two */ son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else Function Canonize(v, α): son(v,t ) := v´´´; v´´ := v while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do i if v´ ≠ 0 then slink(v´) := v´´ v := son(v, α´); α := α´´ v´ := v´´; v := slink(v); (v, α) := Canonize(v, α) return (v, α) if v´ ≠ 0 then slink(v´) := v (v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */ 175 176 The actual me and space • |Tree(T)| is about 20|T| in prac ce • brute-force construc on is O(|T|log|T|) for random http://stackoverflow.com/questions/9452701/ strings as the average depth of internal nodes is ukkonens-suffix-tree-algorithm-in-plain- O(log|T|) english • difference between linear and brute-force construc ons not necessarily large (Giegerich & Kurtz) • truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003) • alphabet independent linear me (Farach 1997) 178 abcabxabcd abc 30 02.01.15 Applica ons of Suffix Trees Back to backtracking • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: ACA, 1 mismatch Computer Science and Computa onal Biology. Hardcover - A T 534 pages 1st edi on (January 15, 1997). Cambridge Univ Pr C (Short); ISBN: 0521585198. - book A T C A AT T A T C C C • APL1: Exact String Matching Search for P from text S. Solu on T T T 1: build STree(S) - one achieves the same O(n+m) as Knuth- 4 2 1 5 6 3 Morris-Pra , for example! • Search from the suffix tree is O(|P|) Same idea can be used to many other forms of approximate search, like • APL2: Exact set matching Search for a set of pa erns P Smith-Waterman, position-restricted scoring matrices, regular expression search, etc. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 186 31 02.01.15 Applica ons of Suffix Trees Applica ons of Suffix Trees • APL3: substring problem for a database of pa erns • APL4: Longest common substring of two strings Given a set of strings S=S1, ... , Sn --- a database Find • Find the longest common substring of S and T. all Si that have P as a substring • Overall there are poten ally O( n2 ) such substrings, • Generalized suffix tree contains all suffixes of all Si if n is the length of a shorter of S and T • Query in me O(|P|), and can iden fy the LONGEST • Donald Knuth once (1970) conjectured that linear- common prefix of P in all Si me algorithm is impossible. • Solu on: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring xxxxxx. Simple analysis task: LCSS Suffix link • Let LCSSA(A,B) denote the longest common substring two sequences A and B. E.g.: X – LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT. • A good solu on is to build suffix tree for the aX shorter sequence and make a descending suffix link suffix walk with the other sequence. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 189 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 190 Another common tool: Generalized Descending suffix walk suffix tree suffix tree of A Read B left-to-right, A node info: always going down in the subtree size 47813871 tree when possible. C sequence count 87 If the next symbol of B does C not match any edge label on current position, take suffix link, and try again. (Suffix link in the root to itself emits a symbol). The node v encountered with largest string depth v is the solution. ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$ ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 191 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 192 32 02.01.15 Generalized suffix tree applica on Case study con nued suffix tree of genome motif? TAC...... T A node info: C subtree size 4398 5 blue blue sequences 12/15 1 red C red sequences 2/62 ..... ...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#... genome ...#....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#...... #... regions with ChIP-seq matches ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 193 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 194 Proper es of suffix tree... in Proper es of suffix tree prac ce • Suffix tree has n leaves and at most n-1 • Huge overhead due to pointer structure: internal nodes, where n is the total length of – Standard implementa on of suffix tree for human all sequences indexed. genome requires over 200 GB memory! • Each node requires constant number of – A careful implementa on (using log n -bit fields integers (pointers to first child, sibling, parent, for each value and array layout for the tree) s ll requires over 40 GB. text range of incoming edge, sta s cs – counters, etc.). Human genome itself takes less than 1 GB using 2- bits per bp. • Can be constructed in linear me. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 195 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 196 Applica ons of Suffix Trees Applica ons of Suffix Trees • APL4: Longest common substring of two strings • APL5: Recognizing DNA contamina on Related • Find the longest common substring of S and T. to DNA sequencing, search for longest strings • Overall there are poten ally O( n2 ) such substrings, (longer than threshold) that are present in the if n is the length of a shorter of S and T DB of sequences of other genomes. • Donald Knuth once (1970) conjectured that linear- • APL6: Common substrings of more than two me algorithm is impossible. strings Generaliza on of APL4, can be done in • Solu on: construct the STree(S+T) and find the node linear (in total length of all strings) me deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring alive. 33 02.01.15 Applica ons of Suffix Trees Applica ons of Suffix Trees • APL7: Building a directed graph for exact • APL8: A reverse role for suffix trees, and major matching: Suffix graph - directed acyclic word space reduc on Index the pa ern, not tree... graph (DAWG), a smallest finite state • Matching sta s cs. automaton recognizing all suffixes of a string • APL10: All-pairs suffix-prefix matching For all S. This automaton can recognize membership, pairs Si, Sj, find the longest matching suffix- but not tell which suffix was matched. prefix pair. Used in shortest common • Construc on: merge isomorfic subtrees. superstring genera on (e.g. DNA sequence • Isomorfic in Suffix Tree when exists suffix link assembly), EST alignmentm etc. path, and subtrees have equal nr. of leaves. Applica ons of Suffix Trees Applica ons of Suffix Trees • APL11: Finding all maximal repe ve • APL14: Suffix trees in genome-scale projects structures in linear me • APL15: A Boyer-Moore approach to exact set • APL12: Circular string lineariza on e.g. circular matching chemical molecules in the database, one • APL16: Ziv-Lempel data compression wants to lienarize them in a canonical way... • APL17: Minimum length encoding of DNA • APL13: Suffix arrays - more space reduc on will touch that separately Applica ons of Suffix Trees Applica ons of Suffix Trees • Addi onal applica ons Mostly exercises... • APL: Exact matching with wild cards • Extra feature: CONSTANT me lowest common ancestor retrieval (LCA) Andmestruktuur mis võimaldab leida konstantse ajaga alumist ühist • APL: The k-mismatch problem vanemat (see vastab pikimale ühisele prefixile!) on võimalik koostada lineaarse ajaga. • Approximate palindromes and repeats • APL: Longest common extension: a bridge to inexact matching • APL: Finding all maximal palindromes in linear me • Faster methods for tandem repeats Palindrome reads from central posi on the same to le and right. E.g.: kirik, saippuakivikauppias. • A linear- me solu on to the mul ple common • Build the suffix tree of S and inverted S (aabcbad => aabcbad#dabcbaa ) substring problem and using the LCA one can ask for any posi on pair (i, 2i-1), the longest common prefix in constant me. • And many-many more ... • The whole problem can be solved in O(n). 34 02.01.15 Suffixes - sorted hattivatti ε attivatti atti 1. Suffix tree ttivatti attivatti tivatti hattivatti 2. Suffix array ivatti i vatti ivatti 3. Some applications atti ti tti tivatti ti tti 4. Finding motifs i ttivatti ε vatti • Sort all suffixes. Allows to perform binary search! 205 206 Suffix array: example Suffix array construc on: sort! 1 hattivatti ε 11 11 hattivatti 1 11 11 2 attivatti atti 7 7 attivatti 2 7 7 3 ttivatti attivatti 2 2 ttivatti 3 2 2 4 tivatti hattivatti 1 1 tivatti 4 1 1 5 ivatti i 10 10 ivatti 5 10 10 6 vatti ivatti 5 5 vatti 6 5 5 7 atti ti 9 9 atti 7 9 9 8 tti tivatti 4 4 tti 8 4 4 9 ti tti 8 8 ti 9 8 8 10 i ttivatti 3 3 i 10 3 3 11 ε vatti 6 ε 11 6 • suffix array = lexicographic order of the suffixes • suffix array = lexicographic order of the suffixes 207 208 Suffix array Reducing space: suffix array • suffix array SA(T) = an array giving the =[2,2] A T =[3,3] lexicographic order of the suffixes of T C • space requirement: 5|T| =[5,6] A C T =[3,6] T A A A T =[6,6] =[4,6] T C C =[2,6] C • prac oners like suffix arrays (simplicity, T T T space efficiency) suffix 4 2 1 5 6 3 array • theore cians like suffix trees (explicit structure) C A T A C T 1 2 3 4 5 6 209 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 210 35 02.01.15 Suffix array Pa ern search from suffix array hattivatti ε 11 • attivatti atti 7 Many algorithms on suffix tree can be att simulated using suffix array... ttivatti attivatti 2 binary search tivatti hattivatti 1 – ... and couple of addi onal arrays... ivatti i 10 – ... forming so-called enhanced suffix array... vatti ivatti 5 – ... leading to the similar space requirement as atti ti 9 careful implementa on of suffix tree tti tivatti 4 ti tti 8 • Not a sa sfactory solu on to the space issue. i ttivatti 3 ε vatti 6 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 211 212 What we learn today? Recent suffix array construc ons • We learn that it is possible to replace suffix • Manber&Myers (1990): O(|T|log|T|) trees with compressed suffix trees that take • linear me via suffix tree 8.8 GB for the human genome. • January / June 2003: direct linear me • We learn that backtracking can be done using construc on of suffix array compressed suffix arrays requiring only 2.1 GB - Kim, Sim, Park, Park (CPM03) for the human genome. - Kärkkäinen & Sanders (ICALP03) • We learn that discovering interes ng mo f - Ko & Aluru (CPM03) seeds from the human genome takes 40 hours and requires 9.3 GB space. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 213 214 Kärkkäinen-Sanders algorithm Nota on 1. Construct the suffix array of the suffixes • string T = T[0,n) = t0t1 … tn-1 starting at positions i mod 3 ≠ 0. This is • done by reduction to the suffix array suffix Si = T[i,0) = titi+1 … tn-1 construction of a string of two thirds the • for C \subset [0,n]: SC = {Si | i in C} length, which is solved recursively. 2. Construct the suffix array of the remaining suffixes using the result of the first step. • suffix array SA[0,n] of T is a permuta on of 3. Merge the two suffix arrays into one. [0,n] sa sfying SSA[0] < SSA[1] < … < SSA[n] 215 216 36 02.01.15 Running example Step 0: Construct a sample 0 1 2 3 4 5 6 7 8 9 10 11 • for k = 0,1,2 • T[0,n) = y a b b a d a b b a d o 0 0 … Bk = {i є [0,n] | i mod 3 = k} • C = B1 U B2 sample posi ons • SA = (12,1,6,4,9,3,8,2,7,5,10,11,0) • SC sample suffixes • Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11} 217 218 Step 1: Sort sample suffixes Step 1 (cont.) • for k = 1,2, construct • once the sample suffixes are sorted, assign a rank to Rk = [t t t ] [t t t ]… [t t t ] k k+1 k+2 k+3 k+4 k+5 maxBk maxBk+1 maxBk+2 each: rank(S ) = the rank of S in S ; rank(S ) = i i C n+1 rank(S ) = 0 R = R1 ^ R2 concatena on of R1 and R2 n+2 • Example: Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ; R = [abb][ada][bba][do0][bba][dab][bad][o00] correspondence is order preserving. R´ = (1,2,4,6,4,5,3,7) Sort the suffixes of R: radix sort the characters and rename with SAR´ = (8,0,1,6,4,2,5,3,7) ranks to obtain R´. If all characters different, their order directly rank(S ) - 1 4 - 2 6 - 5 3 – 7 8 – 0 0 gives the order of suffixes. Otherwise, sort the suffixes of R´ using i Kärkkäinen-Sanders. Note: |R´| = 2n/3. 219 220 Step 2: Sort nonsample suffixes Step 3: Merge • merge the two sorted sets of suffixes using a standard • for each non-sample Si є SB0 (note that comparison-based merging: rank(Si+1) is always defined for i є B0): • to compare Si є SC with Sj є SB0, dis nguish two cases: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • i є B1: S ≤ S ↔ (t ,rank(S )) ≤ (t ,rank(S )) • radix sort the pairs (ti,rank(Si+1)). i j i i+1 j j+1 • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2)) • Example: S12 < S6 < S9 < S3 < S0 because (0,0) < • note that the ranks are defined in all cases! (a,5) < (a,7) < (b,2) < (y,1) • S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7) 221 222 37 02.01.15 Running me O(n) Implementa on • excluding the recursive call, everything can be • about 50 lines of C++ done in linear me • code available e.g. via Juha Kärkkäinen’s home • the recursion is on a string of length 2n/3 page • thus the me is given by recurrence T(n) = T(2n/3) + O(n) • hence T(n) = O(n) 223 224 LCP table • Longest Common Prefix of successive • Oxford English Disc onary h p://www.oed.com/ • Example - Word of the Day , Fourth elements of suffix array: h p://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.html h p://www.oed.com/cgi/display/wotd • LCP[i] = length of the longest common prefix • PAT index - by Gaston Gonnet (ta on samu Maple tarkvara üks loojatest ning hiljem molekulaarbioloogia tarkvarapake väljatöötajaid) of suffixes SSA[i] and SSA[i+1] • PAT index is essen ally a suffix array. To save space, indexed only from • build inverse array SA-1 from SA in linear me first character of every word • XML-tagging (or SGML, at that me!) also indexed • then LCP table from SA-1 in linear me (Kasai • To mark certain fields of XML, the bit vectors were used. • Main concern - improve the speed of search on the CD - minimize random et al, CPM2001) accesses. • For slow medium even 15-20 accesses is too slow... • G. H. Gonnet, R. A. Baeza-Yates, and T. Snider, Lexicographical indices for text: Inverted files vs. PAT trees, Technical Report OED-91-01, Centre for 225 the New OED, University of Waterloo, 1991. Suffix tree vs suffix array 1. Suffix tree • suffix tree ó suffix array + LCP table 2. Suffix array 3. Some applications 4. Finding motifs 227 228 38 02.01.15 Substring mo fs of string T Coun ng the substring mo fs • string T = t1 … tn in alphabet A. • internal nodes of Tree(T) ↔ repea ng • Problem: what are the frequently occurring substrings of T (ungapped) substrings of T? Longest substring • number of leaves of the subtree of a node for that occurs at least q mes? string P = number of occurrences of P in T • Thm: Suffix tree Tree(T) gives complete occurrence counts of all substring mo fs of T in O(n) me (although T may have O(n2) substrings!) 229 230 Substring mo fs of ha va Finding repeats in DNA vatti t i vatti • human chromosome 3 4 2 ti i • the first 48 999 930 bases atti i hattivatti 2 2 vatti • 31 min cpu me (8 processors, 4 GB) ivatti 2 ti vatti vatti 9 vatti tti • Human genome: 3x10 bases atti tivatti • Tree(HumanGenome) feasible hattivatti ttivatti attivatti Counts for the O(n) maximal motifs shown 231 232 Longest repeat? Ten occurrences? Occurrences at: 28395980, 28401554r Length: 2559 ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggat ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcc ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaat gctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaa caagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagt catgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcat agtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacat agagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccg acgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaat cccgcctcggcctcccaaagtgctgggattacaggcgt caccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgc cattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttctttt gagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagt aggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggct tttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatg taagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaagga Length: 277 agggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatc agatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagt atagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttc caattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatga Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, gcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtatttt 47398125 attctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgag actttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctct In the reversed complement at: 17858493, 41463059, 42431718, tttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgt 42580925 gccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaa tttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtac gtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattt tattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatga gttagg 233 234 39 02.01.15 Using suffix trees: approximate Using suffix trees: plagiarism matching • find longest common substring of strings X and Y • edit distance: inser ons, dele ons, changes • build Tree(X$Y) and find the deepest node which has a leaf poin ng to X and another • STOCKHOLM vs TUKHOLMA poin ng to Y 235 236 String distance/similarity func ons Approximate string matching • A: S T O C K H O L M STOCKHOLM vs TUKHOLMA B: STOCKHOLM_ T U K H O L M A _TU_ • KHOLMA minimum number of ’muta on’ steps: a -> b a -> є є -> b … => 2 deletions, 1 insertion, 1 change • dID(A,B) = 5 dLevenshtein(A,B)= 4 • muta on costs => probabilis c modeling • evalua on by dynamic programming => alignment 237 238 Dynamic programming di,j = min(if ai=bj then di-1,j-1 else ∞, di,j = min(if ai=bj then di-1,j-1 else ∞, di-1,j + 1, di-1,j + 1, di,j-1 + 1) di,j-1 + 1) = distance between i-prefix of A and j-prefix of B A\B s t o c k h o l m (substitution excluded) 0 1 2 3 4 5 6 7 8 9 B t 1 2 1 2 3 4 5 6 7 8 mxn table d bj u 2 3 2 3 4 5 6 7 8 9 k 3 4 3 4 5 4 5 6 7 8 h 4 5 4 5 6 5 4 5 6 7 d d i-1,j-1 i-1,j o 5 6 5 4 5 6 5 4 5 6 A +1 a d d l 6 7 6 5 6 7 6 5 4 5 i i,j-1 +1 i,j m 7 8 7 6 7 8 7 6 5 4 a 8 9 8 7 8 9 8 7 6 5 dm,n 239 optimal alignment by trace-back 240 dID(A,B) 40 02.01.15 Search problem Index for approximate searching? • find approximate occurrences of pa ern P in • dynamic programming: P x Tree(T) with backtracking text T: substrings P’ of T such that d(P,P’) small • dyn progr with small modifica on: O(mn) Tree(T) • lots of (prac cal) improvement tricks P T P’ P 241 242 41