Text Algorithms
Jaak Vilo 2015 fall
Jaak Vilo MTAT.03.190 Text Algorithms 1 Topics
• Exact matching of one pa ern(string) • Exact matching of mul ple pa erns • Suffix trie and tree indexes – Applica ons • Suffix arrays • Inverted index • Approximate matching Exact pa ern matching
• S = s1 s2 … sn (text) |S| = n (length)
• P = p1p2..pm (pa ern) |P| = m
• Σ - alphabet | Σ| = c
• Does S contain P? – Does S = S' P S" fo some strings S' ja S"?
– Usually m << n and n can be (very) large
Find occurrences in text
P
S Algorithms
One-pa ern Mul -pa ern • Brute force • Aho Corasick • Knuth-Morris-Pra • Commentz-Walter • Karp-Rabin • Shi -OR, Shi -AND • Boyer-Moore Indexing • • Factor searches Trie (and suffix trie) • Suffix tree Anima ons
• h p://www-igm.univ-mlv.fr/~lecroq/string/
• EXACT STRING MATCHING ALGORITHMS Anima on in Java • Chris an Charras - Thierry Lecroq Laboratoire d'Informa que de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-Saint-Aignan Cedex FRANCE • e-mails: {Chris an.Charras, Thierry.Lecroq} @laposte.net Brute force: BAB in text?
ABACABABBABBBA
BAB Brute Force
P i i+j-1 S
j Iden fy the first mismatch!
Ques on:
§Problems of this method? L §Ideas to improve the search? J Brute force
attempt 1: gcatcgcagagagtatacagtacg Algorithm Naive GCAg....
attempt 2: Input: Text S[1..n] and gcatcgcagagagtatacagtacg pa ern P[1..m] g......
attempt 3: Output: All posi ons i, where gcatcgcagagagtatacagtacg g...... P occurs in S attempt 4: gcatcgcagagagtatacagtacg g...... for( i=1 ; i <= n-m+1 ; i++ ) attempt 5: gcatcgcagagagtatacagtacg for ( j=1 ; j <= m ; j++ ) g......
if( S[i+j-1] != P[j] ) break ; attempt 6: gcatcgcagagagtatacagtacg GCAGAGAG if ( j > m ) print i ; attempt 7: gcatcGCAGAGAGtatacagtacg
g...... Brute force or NaiveSearch
1 func on NaiveSearch(string s[1..n], string sub[1..m]) 2 for i from 1 to n-m+1 3 for j from 1 to m 4 if s[i+j-1] ≠ sub[j] 5 jump to next itera on of outer loop 6 return i 7 return not found C code int bf_2( char* pat, char* text , int n ) /* n = textlen */ { int m, i, j ; int count = 0 ; m = strlen(pat);
for ( i=0 ; i + m <= n ; i++) {
for( j=0; j < m && pat[j] == text[i+j] ; j++) ;
if( j == m ) count++ ; }
return(count); } C code int bf_1( char* pat, char* text ) { int m ; int count = 0 ; char *tp;
m = strlen(pat); tp=text ;
for( ; *tp ; tp++ ) { if( strncmp( pat, tp, m ) == 0 ) { count++ ; } }
return( count ); } Main problem of Naive
P i i+j-1 S
j S j
• For the next possible loca on of P, check again the same posi ons of S Goals
• Make sure only a constant nr of comparisons/ opera ons is made for each posi on in S – Move (only) from le to right in S
– How? – A er a test of S[i] <> P[j] what do we now ? D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. Knuth-Morris-Pra SIAM Journal on Computing 6:323-350, 1977.
• Make sure that no comparisons “wasted”
x ≠ y
• A er such a mismatch we already know exactly the values of green area in S ! D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. Knuth-Morris-Pra SIAM Journal on Computing 6:323-350, 1977.
• Make sure that no comparisons “wasted”
prefix x ≠ prefix y ≠ p z
• P – longest suffix of any prefix that is also a prefix of a pa ern • Example: ABCABD ABCABD Automaton for ABCABD
NOT A
A B C A B D 1 2 3 4 5 6 7 Automaton for ABCABD
NOT A
A B C A B D 1 2 3 4 5 6 7
Fail: 0 1 1 1 2 3 1
Pattern: A B C A B D
1 2 3 4 5 6 KMP matching
Input: Text S[1..n] and pa ern P[1..m] Output: First occurrence of P in S (if exists) i=1; j=1; initfail(P) // Prepare fail links repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link until j>m or i>n if j>m then report match at i-m Ini aliza on of fail links
Algorithm: KMP_Ini ail Input: Pa ern P[1..m] Output: fail[] for pa ern P i=1, j=0 , fail[1]= 0 repeat if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j else j = fail[j] until i>=m Ini aliza on of fail links i ABCABD i=1, j=0 , fail[1]= 0 j repeat if j==0 or P[i] == P[j]
then i++ , j++ , fail[i] = j Fail: 0 else j = fail[j] until i>=m 0 1
0 1 1 1
ABCABD
0 1 1 1 2 Time complexity of KMP matching?
Input: Text S[1..n] and pa ern P[1..m] Output: First occurrence of P in S (if exists) i=1; j=1; initfail(P) // Prepare fail links repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link until j>m or i>n if j>m then report match at i-m Analysis of me complexity
• At every cycle either i and j increase by 1 • Or j decreases (j=fail[j])
• i can increase n (or m) mes • Q: How o en can j decrease? – A: not more than nr of increases of i
• Amor sed analysis: O(n) , preprocess O(m) Karp-Rabin R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.
• Compare in O(1) a hash of P and S[i..i+m-1]
i .. (i+m-1)
h(T[i.. i+m-1])
h(P)
1..m • Goal: O(n). • f( h(T[i.. i+m-1]) -> h(T[i+1..i+m]) ) = O(1) Karp-Rabin R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.
• Compare in O(1) a hash of P and S[i..i+m-1]
i .. (i+m-1) i .. (i+m-1)
h(T[i+1..i+m])
h(P)
1..m • Goal: O(n). • f( h(T[i.. i+m-1]) -> h(T[i+1..i+m]) ) = O(1) Hash
• “Remove” the effect of T[i] and “Introduce” the effect of T[i+m] – in O(1)
• Use base |Σ| arithme cs and treat charcters as numbers
• In case of hash match – check all m posi ons • Hash collisions => Worst case O(nm) Let’s use numbers
• T= 57125677 • P= 125 (and for simplicity, h=125)
• H(T[1])=571 • H(T[2])= (571-5*100)*10 + 2 = 712
• H(T[3]) = (H(T[2]) – ord(T[1])*10m)*10 + T[3+m-1] hash
• c – size of alphabet
• HSi = H( S[i..i+m-1] )
• H( S[i+1 .. i+m ] ) = ( HSi – ord(S[i])* cm-1 ) * c + ord( S[i+m] )
• Modulo arithme c – to fit value in a word! • hash(w[0 .. m-1])=(w[0]*2m-1+ w[1]*2m-2+···+ w[m-1]*20) mod q Karp-Rabin
Input: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */ 2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0 5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position
8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q 10. if hp == hs and P == S[i..i+m-1] 11. report match at position i More ways to ensure O( n ) ? Shi -AND / Shi -OR
• Ricardo Baeza-Yates , Gaston H. Gonnet A new approach to text searching Communica ons of the ACM October 1992, Volume 35 Issue 10 [ACM Digital Library:h p://doi.acm.org/10.1145/135239.135243] [DOI] • PDF Bit-opera ons
• Maintain a set of all prefixes that have so far had a perfect match
• On the next character in text update all previous pointers to a new set
• Bit vector: for every possible character State: which prefixes match?
0
1 0 0 1 Move to next: Track posi ons of prefix matches
0 1 0 1 0 1
1 0 1 0 1 1 Shift left <<
Mask on char T[i] 1 0 0 0 1 1 Bitwise AND
1 0 0 0 1 1 Vectors for every char in Σ
• P=aste
a s t e b c d .. z 1 0 0 0 0 ... 0 1 0 0 0 ... 0 0 1 0 0 ... 0 0 0 1 0 ...
• T= lasteaed
l a s t e a e d 0 1 0 0 0 0 0 0
• T= lasteaed
l a s t e a e d 0 1 0 0 0 1 0 0 0 0 0 0
• T= lasteaed
l a s t e a e d 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
• T= lasteaed
l a s t e a e d 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 http://www-igm.univ-mlv.fr/~lecroq/string/node6.html http://www-igm.univ-mlv.fr/~lecroq/string/node6.html
http://www-igm.univ-mlv.fr/~lecroq/string/node6.html
[A] 1 1 0 1 0 1 0 1 Summary
Algorithm Worst case Ave. Case Preprocess
Brute force O(mn) O(n * (1+1/|Σ|+..)
Knuth-Morris-Pra O(n) O(n) O(m)
Rabin-Karp O(mn) O(n) O(m)
Boyer-Moore O(n/m) ?
BM Horspool
Factor search
Shi -OR O(n) O(n) O( m|Σ| )
• R. Boyer, S.Moore: A fast string searching algorithm. CACM 20 (1977), 762-772 [PDF] • http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf
47 Find occurrences in text
P
S
• Have we missed anything?
48 Find occurrences in text
P
S ABCDE BBCDE
• What have we learned if we test for a poten al match from the end?
49 50 Find occurrences in text
P
S A B
51 Bad character heuris cs maximal shi on S[i]
P S[i]
S A X B
S X X First x in pattern (from end)
delta1( S[i] ) – |m| if pattern does not contain S[i] patlen-j max j so that P[j] == S[i]
52 void bmInitocc() { char a; int j; for(a=0; a 53 Good suffix heuris cs P S A µ B 1. S µ µ 2. S µ’ µ delta2( S[i] ) – minimal shift so that matched region is fully covered or that the sufix of match is also a prefix of P 54 Boyer-Moore algorithm Input: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S preprocess_BM() // delta1 and delta2 i=m while i <= n for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ; if j==0 report match at position i-m+1 i = i+ max( delta1[ S[i] ], delta2[ j ] ) 55 • h p://www.i . -flensburg.de/lang/algorithmen/pa ern/bmen.htm • h p://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Ar cles/Exact/Boyer-Moore-original-p762-boyer.pdf • Anima on: h p://www-igm.univ-mlv.fr/~lecroq/string/ 56 Simplifica ons of BM • There are many variants of Boyer-Moore, and many scien fic papers. • On average the me complexity is sublinear • Algorithm speed can be improved and yet simplify the code. • It is useful to use the last character heuris cs (Horspool (1980), Baeza-Yates(1989), Hume and Sunday(1991)). 57 Algorithm BMH (Boyer-Moore-Horspool) • RN Horspool - Prac cal Fast Searching in Strings So ware - Prac ce and Experience, 10(6):501-506 1980 Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. i=m 4. while i <= n 5. if S[i] == P[m] 6. j = m-1 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ] 58 String Matching: Horspool algorithm • How the comparison is made? Text : Pattern : From right to left: suffix search • Which is the next position of the window? Text : a Pattern : It depends of where appears the last letter of the text, say it ‘a’, in the pattern: a a a a a a a a a Then it is necessary a preprocess that determines the length of the shift. Algorithm Boyer-Moore-Horspool-Hume-Sunday (BMHHS) • Use delta in a ght loop • If match (delta==0) then check and apply original delta d Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m] 4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 0 5. for ( i=m ; i<= n ; i = i+d ) 6. repeat // skip loop 7. t=delta[ S[i] ] ; i = i + t 8. until t==0 9. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ; 10. if j==0 report match at i-m+1 BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P (in order for the algorithm to finish correctly – at least one occurrence!). 60 • Daniel M. Sunday: A very fast substring search algorithm [PDF] Communica ons of the ACM August 1990, Volume 33 Issue 8 • Loop unrolling: • Avoid too many loops (each loop requires tests) by just repea ng code within the loop. • Line 7 in previous algorithm can be replaced by: 7. i += delta[ S[i] ]; i += delta[ S[i] ]; i += (t = delta[ S[i] ]) ; 61 62 Forward-Fast-Search: Another Fast Variant of the Boyer-Moore String Matching Algorithm • The Prague Stringology Conference '03 • Domenico Cantone and Simone Faro • Abstract: We present a varia on of the Fast-Search string matching algorithm, a recent member of the large family of Boyer-Moore-like algorithms, and we compare it with some of the most effec ve string matching algorithms, such as Horspool, Quick Search, Tuned Boyer- Moore, Reverse Factor, Berry-Ravindran, and Fast-Search itself. All algorithms are compared in terms of run- me efficiency, number of text character inspec ons, and number of character comparisons. It turns out that our new proposed variant, though not linear, achieves very good results especially in the case of very short pa erns or small alphabets. • h p://cs.felk.cvut.cz/psc/event/2003/p2.html • PS.gz (local copy) 63 Factor based approach • Op mal average-case algorithms – Assuming independent characters, same probability • Factor – a substring of a pa ern – Any substring – (how many?) 64 Factor based approach • Op mal average-case algorithms – Assuming independent characters, same probability 65 Factor searches P u S X Do not compare characters, but find the longest match to any subregion of the pattern. 66 Examples • Backward DAWG Matching (BDM) – Crochemore et al 1994 • Backward Nondeterminis c DAWG Matching (BNDM) – Navarro, Raffinot 2000 • Backward Oracle Matching (BOM) – Allauzen, Crochermore, Raffinot 2001 67 Backward DAWG Matching BDM Suffix automaton recognises all factors (and suffixes) in O(n) Do not compare characters, but find the longest match to any subregion of the pattern. 68 BNDM – simulate using bitparallelism Bits – show where the factors have occurred so far 69 BNDM matches an NDA NDA on the suffixes of ‘announce’ 70 Determinis c version of the same Backward Factor Oracle 71 BNDM – Backward Non-Deterministic DAWG Matching BOM - Backward Oracle matching 72 String Matching of one pa ern CTACTACTACGTCTATACTGATCGTAGC TACTACGGTATGACTAA 1. Prefix search 2. Suffix search 3. Factor search Mul ple pa erns S {P} Why? • Mul ple pa erns • Highlight mul ple different search words on the page • Virus detec on – filter for virus signatures • Spam filters • Scanner in compiler needs to search for mul ple keywords • Filter out stop words or disallowed words • Intrusion detec on so ware • Next-genera on sequencing produces huge amounts (many millions) of short reads (20-100 bp) that need to be mapped to genome! • … Algorithms • Aho-Corasick (search for mul ple words) – Generaliza on of Knuth-Morris-Pra • Commentz-Walter – Generaliza on of Boyer-Moore & AC • Wu and Manber – improvement over C-W • Addi onal methods, tricks and techniques Aho-Corasick (AC) • Alfred V. Aho and Margaret J. Corasick (Bell Labs, Murray Hill, NJ) Efficient string matching. An aid to bibliographic search. Communica ons of the ACM, Volume 18 , Issue 6, p333-340 (June 1975) • ACM:DOI PDF • ABSTRACT This paper describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of construc ng a finite state pa ern matching machine from the keywords and then using the pa ern matching machine to process the text string in a single pass. Construc on of the pa ern matching machine takes me propor onal to the sum of the lengths of the keywords. The number of state transi ons made by the pa ern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10. References: • Generaliza on of KMP for many pa erns • Text S like before. • Set of pa erns P = { P1 , .., Pk } • Total length |P| = m = Σ i=1..k mi • Problem: find all occurrences of any of the Pi ∈ P from S Idea 1. Create an automaton from all pa erns 2. Match the automaton • Use the PATRICIA trie for crea ng the main structure of the automaton PATRICIA trie • D. R. Morrison, "PATRICIA: Prac cal Algorithm To Retrieve Informa on Coded In Alphanumeric", Journal of the ACM 15 (1968) 514-534. • Abstract PATRICIA is an algorithm which provides a flexible means of storing, indexing, and retrieving informa on in a large file, which is economical of index space and of reindexing me. It does not require rearrangement of text or index as new material is added. It requires a minimum restric on of format of text and of keys; it is extremely flexible in the variety of keys it will respond to. It retrieves informa on in response to keys furnished by the user with a quan ty of computa on which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several varia ons as FORTRAN programs for the CDC-3600, u lizing disk file storage of text. It has been applied to several large informa on-retrieval problems and will be applied to others. • ACM:DOI PDF • Word trie - a good data structure to represent a set of words (e.g. a dic onary). • trie (data structure) • Defini on: A tree for storing strings in which there is one node for every common preffix. The strings are stored in extra leaf nodes. • See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree. • Note: The name comes from retrieval and is pronounced, "tree." • To test for a word p, only O(|p|) me is used no ma er how many words are in the dic onary... Trie for P={he, she, his, hers} 0 0 h h s 1 1 3 e e h 2 2 4 e 5 Trie for P={he, she, his, hers} 0 h s 1 3 i e h 2 6 4 r s e 8 5 s 7 9 How to search for words like he, sheila, hi. Do these occur in the trie? 0 h s 1 3 i e h 2 6 4 r s e 8 5 s 7 9 Aho-Corasick 1. Create an automaton MP for a set of strings P. 2. Finite state machine: read a character from text, and change the state of the automaton based on the state transi ons... 3. Main links: goto[j,c] - read a character c from text and go from a state j to state goto[j,c]. 4. If there are no goto[j,c] links on character c from state j, use fail[j]. 5. Report the output. Report all words that have been found in state j. AC Automaton (vs KMP) NOT { h, s } goto[1,i] = 6. ; fail[7] = 3, 0 fail[8] = 0 , h s fail[5]=2. 1 3 e i h 2 Output table 4 r state output[j] e 2 he 6 5 she, he 8 5 s s 7 his 9 hers 9 7 AC - matching Input: Text S[1..n] and an AC automaton M for pa ern set P Output: Occurrences of pa erns from P in S (last posi on) 1. state = 0 2. for i = 1..n do 3. while (goto[state,S[i]]== ∅ ) and (fail[state]!=state) do 4. state = fail[state] 5. state = goto[state,S[i]] 6. if ( output[state] not empty ) 7. then report matches output[state] at posi on i Algorithm Aho-Corasick preprocessing I (TRIE) Input: P = { P1, ... , Pk } Output: goto[] and par al output[] Assume: output(s) is empty when a state s is created; goto[s,a] is not defined. procedure enter(a1, ... , am) /* Pi = a1, ... , am */ begin begin 10. news = 0 1. s=0; j=1; 11. for i=1 to k do enter( Pi ) 12. for a ∈ Σ do 2. while goto[s,aj] ≠ ∅ do // follow exis ng path 3. s = goto[s,a ]; j ∅ 4. j = j+1; 13. if goto[0,a] = then goto[0,a] = 0 ; end 5. for p=j to m do // add new path (states) 6. news = news+1 ; 7. goto[s,ap] = news; 8. s = news; 9. output[s] = a1, ... , am end Preprocessing II for AC (FAIL) queue = ∅ for a ∈ Σ do if goto[0,a] ≠ 0 then enqueue( queue, goto[0,a] ) fail[ goto[0,a] ] = 0 while queue ≠ ∅ r = take( queue ) for a ∈ Σ do if goto[r,a] ≠ ∅ then s = goto[ r, a ] enqueue( queue, s ) // breadth first search state = fail[r] while goto[state,a] = ∅ do state = fail[state] fail[s] = goto[state,a] output[s] = output[s] + output[ fail[s] ] Correctness • Let string t "point" from ini al state to state j. • Must show that fail[j] points to longest suffix that is also a prefix of some word in P. • Look at the ar cle... AC matching me complexity • Theorem For matching the MP on text S, |S|=n, less than 2n transi ons within M are made. • Proof Compare to KMP. • There is at most n goto steps. • Cannot be more than n Fail-steps. • In total -- there can be less than 2n transi ons in M. Individual node (goto) • Full table • List • Binary search tree(?) • Some other index? AC thoughts • Scales for many strings simultaneously. • For very many pa erns – search me (of grep) improves(??) – See Wu-Manber ar cle • When k grows, then more fail[] transi ons are made (why?) • But always less than n. • If all goto[j,a] are indexed in an array, then the size is |MP|*|Σ|, and the running me of AC is O(n). • When k and c are big, one can use lists or trees for storing transi on func ons. • Then, O(n log(min(k,c)) ). Advanced AC • Precalculate the next state transi on correctly for every possible character in alphabet • Can be good for short pa erns Problems of AC? • Need to rebuild on adding / removing pa erns • Details of branching on each node(?) Commentz-Walter • Generaliza on of Boyer-Moore for mul ple sequence search • Beate Commentz-Walter A String Matching Algorithm Fast on the Average Proceedings of the 6th Colloquium, on Automata, Languages and Programming. Lecture Notes In Computer Science; Vol. 71, 1979. pp. 118 - 132, Springer-Verlag • h p://www. -albsig.de/win/personen/professoren.php?RID=36 • You can download here my algorithm StringMatchingFastOnTheAverage (PDF, ~17,2 MB) or here StringMatchingFastOnTheAverage (extended abstract) (PDF, ~3 MB) C-W descrip on • Aho and Corasick [AC75] presented a linear- me algorithm for this problem, based on an automata approach. This algorithm serves as the basis for the UNIX tool fgrep. A linear- me algorithm is op mal in the worst case, but as the regular string-searching algorithm by Boyer and Moore [BM77] demonstrated, it is possible to actually skip a large por on of the text while searching, leading to faster than linear algorithms in the average case. Commentz-Walter [CW79] • Commentz-Walter [CW79] presented an algorithm for the mul -pa ern matching problem that combines the Boyer- Moore technique with the Aho-Corasick algorithm. The Commentz-Walter algorithm is substan ally faster than the Aho-Corasick algorithm in prac ce. Hume [Hu91] designed a tool called gre based on this algorithm, and version 2.0 of fgrep by the GNU project [Ha93] is using it. • Baeza-Yates [Ba89] also gave an algorithm that combines the Boyer-Moore-Horspool algorithm [Ho80] (which is a slight varia on of the classical Boyer-Moore algorithm) with the Aho-Corasick algorithm. Idea of C-W • Build a backward trie of all keywords • Match from the end un l mismatch... • Determine the shi based on the combina on of heuris cs Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG 1. Build the trie of the inverted patterns T G T A A T G G T A T A A T A A 2. lmin=4 A 1 3. Table of shifts C 4 (lmin) G 2 T 1 4. Start the search Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1 Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1 Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1 Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1 Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1 Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1 … Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many pa erns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1 Short Shifts! … Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) What are the possible limita ons for C-W? • Many pa erns, small alphabet – minimal skips • What can be done differently? Wu-Manber • Wu S., and U. Manber, "A Fast Algorithm for Mul -Pa ern Searching," Technical Report TR-94-17, Department of Computer Science, University of Arizona (May 1993). • Citeseer: h p://citeseer.ist.psu.edu/wu94fast.html [Postscript] • We present a different approach that also uses the ideas of Boyer and Moore. Our algorithm is quite simple, and the main engine of it is given later in the paper. An earlier version of this algorithm was part of the second version of agrep [WM92a, WM92b], although the algorithm has not been discussed in [WM92b] and only briefly in [WM92a]. The current version is used in glimpse [MW94]. The design of the algorithm concentrates on typical searches rather than on worst-case behavior. This allows us to make some engineering decisions that we believe are crucial to making the algorithm significantly faster than other algorithms in prac ce. Key idea • Main problem with Boyer-Moore and many pa erns is that, the more there are pa erns, the shorter become the possible shi s... • Wu and Manber: check several characters simultaneously, i.e. increase the alphabet. • Instead of looking at characters from the text one by one, we consider them in blocks of size B. • logc2M; in prac ce, we use either B = 2 or B = 3. • The SHIFT table plays the same role as in the regular Boyer-Moore algorithm, except that it determines the shi based on the last B characters rather than just one character. Horspool to Wu-Manber How do we can increase the length of the shifts? With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG 1 símbol 2 símbols A 1 C 4 (lmin) AA 1 G 2 AC 3 (LMIN-L+1) T 1 AG 3 AT 1 AA 1 CA 3 AT 1 CC 3 GT 1 CG 3 TA 2 … TG 2 Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A AA 1 A AT 1 GT 1 into the text: ACATGCTATGTGACATAATA TA 2 TG 2 … Experimental length: log|Σ| 2*lmin*r Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Backward Oracle • Set Backwards oracle SBDM, SBOM • Pages 68-72 String matching of many pa erns | Σ| 8 (5 patterns) Wu-Manber 4 SBOM 2 Lmin 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 patterns) 4 SBOM 2 5 10 15 20 25 30 35 40 45 8 Wu-Manber (100 patterns) 4 SBOM 2 Slides courtesy of: 5Xavier Messeguer 10 Peypoch (http:// 15www.lsi.upc.es /~alggen 20 ) 25 30 35 40 45 String matching of many pa erns | Σ| 8 (5 patterns) Wu-Manber 4 SBOM 2 Lmin 5 10 15 20 25 30 35 40 45 8 Wu-Manber (10 patterns) 4 SBOM 2 5 10 15 20 25 30 35 40 45 (1000 patterns) 8 SBOM (100 patterns) 4 2 Slides courtesy of: 5Xavier Messeguer 10 Peypoch (http:// 15www.lsi.upc.es /~alggen 20 ) 25 30 35 40 45 5 strings 10 strings 100 strings 1000 strings Factor Oracle Factor Oracle: safe shi Factor Oracle: Shift to match prefix of P2? Factor oracle Construc on of factor Oracle Factor oracle • Allauzen, C., Crochemore, M., and Raffinot, M. 1999. Factor Oracle: A New Structure for Pa ern Matching. In Proceedings of the 26th Conference on Current Trends in theory and Prac ce of informa cs on theory and Prac ce of informa cs (November 27 - December 04, 1999). J. Pavelka, G. Tel, and M. Bartosek, Eds. Lecture Notes In Computer Science, vol. 1725. Springer- Verlag, London, 295-310. • h p://portal.acm.org/cita on.cfm? id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=6 1811641# • h p://www-igm.univ-mlv.fr/~allauzen/work/sofsem.ps So far • Generalised KMP -> AhoCorasick • Generalised Horspool -> CommentzWalter, WuManber • BDM, BOM -> Set Backward Oracle Matching… • Other generalisa ons? Mul ple Shi -AND • P={P1, P2, P3, P4}. Generalize Shi -AND • Bits = P4 P3 P2 P1 • Start = 1 1 1 1 • Match= 1 1 1 1 Text Algorithms (6EAP) Full text indexing Jaak Vilo 2010 fall Jaak Vilo MTAT.03.190 Text Algorithms 133 Problem • Given P and S – find all exact or approximate occurrences of P in S • You are allowed to preprocess S (and P, of course) • Goal: to speed up the searches E.g. Dic onary problem • Does P belong to a dic onary D={d1,…,dn} – Build a binary search tree of D – B-Tree of D – Hashing – Sor ng + Binary search • Build a keyword trie: search in O(|P|) – Assuming alphabet has up to a constant size c – See Aho-Corasick algorithm, Trie construc on Sorted array and binary search 1 13 global info head search he stop header informal hers index happy his show Sorted array and binary search 1 13 global info head search he stop header informal hers index happy his show O( |P| log n ) Trie for D={ he, hers, his, she} 0 h s 1 3 i e h 2 6 4 r s e 8 5 s 7 9 O( |P| ) S != set of words • S of length n • How to index? • Index from every posi on of a text • Prefix of every possible suffix is important Trie(babaab) a b babaab b a abaab a a b baab b aab a a a ab b b a b b Suffix tree • Defini on: A compact representa on of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents. • Defini on ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a rooted, labeled tree with a leaf for each non-empty suffix of S. Furthermore, a suffix tree sa sfies the following proper es: • Each internal node, other than the root, has at least two children; • Each edge leaving a par cular node is labeled with a non-empty substring of S of which the first symbol is unique among all first symbols of the edge labels of the edges leaving this par cular node; • For any leaf in the tree, the concatena on of the edge labels on the path from the root to this leaf exactly spells out a non-empty suffix of s. • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computa onal Biology. Hardcover - 534 pages 1st edi on (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. Literature on suffix trees • h p://en.wikipedia.org/wiki/Suffix_tree • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computa onal Biology. Hardcover - 534 pages 1st edi on (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: 89--208) • E. Ukkonen. On-line construc on of suffix trees. Algorithmica, 14:249-60, 1995. h p://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751 • Ching-Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Construc ng Suffix Tree for Gigabyte Sequences with Megabyte Memory," IEEE Transac ons on Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-105, January, 2005. h p://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3 • CPM ar cles archive: h p://www.cs.ucr.edu/~stelo/cpm/ • Mark Nelson. Fast String Searching With Suffix Trees Dr. Dobb's Journal, August, 1996. h p://www.dogma.net/markn/ar cles/suffixt/suffixt.htm • STXXL : Standard Template Library for Extra Large Data Sets. • h p://stxxl.sourceforge.net/ • ANSI C implementa on of a Suffix Tree • h p://yeda.cs.technion.ac.il/~yona/suffix_tree/ These are old links – check for newer… http://stackoverflow.com/questions/9452701/ ukkonens-suffix-tree-algorithm-in-plain- english The suffix tree Tree(T) of T • data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T • linear size: |Tree(T)| = O(|T|) • can be constructed in linear me O(|T|) • has myriad virtues (A. Apostolico) • is well-known: 366 000 Google hits E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt145 Suffix trie and suffix tree abaab baab aab ab b Trie(abaab) Tree(abaab) a b a baab b a a ab a baab b a a b b E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt146 Suffix tree and suffix array techniques for pa ern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt147 Algorithms for combinatorial string matching? • deep beauty? +- • shallow beauty? + • applica ons? ++ • intensive algorithmic miniatures • sources of new problems: text processing, DNA, music,… E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt148 High-throughput genome-scale sequence analysis and mapping using compressed data Veli Mäkinen Department of Computer Science structures University of Helsinki ttttttttttttttgagacggagtctcgctctg tcgcccaggctggagtgcagtggcggg atctcggctcactgcaagctccgcctcc cgggttcacgccattctcctgcctcagcc tcccaagtagctgggactacaggcgcc cgccactacgcccggctaattttttgtattt ttagtagagacggggtttcaccgttttagc cgggatggtctcgatctcctgacctcgtg atccgcccgcctcggcctcccaaagtgc tgggattacaggcgt E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt150 Analysis of a string of symbols • T = h a t t i v a t t i ’text’ • P = a t t ’pa ern’ • Find the occurrences of P in T: h a t t i v a t t i • Pa ern synthesis: #(t) = 4 #(a ) = 2 #(t****t) = 2 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt151 Solu on: backtracking with suffix tree ...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG..... ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 152 Pa ern finding & synthesis problems • T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different pa erns P can be found fast – sta c text, any given pa ern P • Pa ern synthesis problem: Learn from T new pa erns that occur surprisingly o en • What is a pa ern? Exact substring, approximate substring, with generalized symbols, with gaps, … 153 1. Suffix tree 2. Suffix array 3. Some applications 4. Finding motifs E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt154 The suffix tree Tree(T) of T • data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T • linear size: |Tree(T)| = O(|T|) • can be constructed in linear me O(|T|) • has myriad virtues (A. Apostolico) • is well-known: 366 000 Google hits E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt155 Suffix trie and suffix tree abaab baab aab ab b Trie(abaab) Tree(abaab) a b a baab b a a ab a baab b a a b b 156 Trie(T) can be large • |Trie(T)| = O(|T|2) • bad example: T = anbn • Trie(T) can be seen as a DFA: language accepted = the suffixes of T • minimize the DFA => directed cyclic word graph (’DAWG’) 157 Tree(T) is of linear size • only the internal branching nodes and the leaves represented explicitly • edges labeled by substrings of T • v = node(α) if the path from root to v spells α • one-to-one correspondence of leaves and suffixes • |T| leaves, hence < |T| internal nodes • |Tree(T)| = O(|T| + size(edge labels)) 158 Tree(ha va ) hattivatti vatti attivatti t i vatti ttivatti ti i tivatti atti i vatti ivatti hattivatti ivatti vatti ti vatti vatti atti vatti tti tti atti tivatti ti hattivatti ttivatti attivatti i 159 Tree(ha va ) hattivatti vatti attivatti t i vatti ttivatti ti i tivatti atti i vatti ivatti hattivatti ivatti vatti ti vatti vatti atti vatti tti tti atti tivatti ti hattivatti ttivatti attivatti i substring labels of edges represented as hattivatti pairs of pointers 160 Tree(ha va ) hattivatti vatti attivatti i 3,3 6 ttivatti 4,5 10 tivatti 2,5 i vatti ivatti hattivatti vatti 9 5 6,10 vatti atti 6,10 8 tti 7 4 ti 1 3 2 i hattivatti 161 Tree(T) is full text index Tree(T) P P occurs in T at locations 8, 31, … 31 8 P occurs in T ó P is a prefix of some suffix of T ó Path for P exists in Tree(T) All occurrences of P in time O(|P| + #occ) 162 Find a from Tree(ha va ) hattivatti vatti attivatti t vatti ttivatti ti i tivatti atti i vatti ivatti hattivatti ivatti vatti ti vatti vatti atti vatti tti tti atti 7 tivatti ti hattivatti ttivatti attivatti i 2 163 Linear me construc on of Tree(T) hattivatti attivatti ttivatti tivatti ivatti Weiner McCreight vatti (1973), (1976) atti ’algorithm tti of the ti year’ ’on-line’ algorithm i (Ukkonen 1992) 164 On-line construc on of Trie(T) • T = t1t2 … tn$ • Pi = t1t2 … ti i:th prefix of T • on-line idea: update Trie(Pi) to Trie(Pi+1) • => very simple construc on 165 Trie(abaab) Trie(a) Trie(ab) Trie(aba) abaa a a b a b baa b b aa a εa a ε chain of links connects the end points of current suffixes 166 Trie(abaab) Trie(abaa) a a b a b a b b b b a a a a a a a 167 Trie(abaab) Trie(abaa) a a b a b a b b b b a a a a a a a Add next symbol = b 168 Trie(abaab) Trie(abaa) a a b a b a b b b b a a a a a a a Add next symbol = b From here on b-arc already exists 169 Trie(abaab) a a b a b a b b b b a a a a a a a Trie(abaab) a b b a a a b a a b b 170 What happens in Trie(Pi) => Trie(Pi+1) ? Before a From here on the ai-arc i exists already => stop updating here After ai ai ai ai ai New suffix links New nodes 171 What happens in Trie(Pi) => Trie(Pi+1) ? • me: O(size of Trie(T)) • suffix links: slink(node(aα)) = node(α) 172 On-line procedure for suffix trie 1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root 2. for i := 2 to n do begin 3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf) 4. v := vi-1; v´ := 0 5. while node v has no outgoing arc for ti do begin 6. Create a new node v´´ and an arc son(v,ti) = v´´ 7. if v´ ≠ 0 then slink(v) := v´´ 8. v := slink(v); v´ := v´´ end 9. for the node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´ 173 Suffix trees on-line • ’compacted version’ of the on-line trie construc on: simulate the construc on on the linear size tree instead of the trie => me O(| T|) • all trie nodes are conceptually s ll needed => implicit and real nodes 174 Implicit and real nodes • Pair (v,α) is an implicit node in Tree(T) if v is a node of Tree and α is a (proper) prefix of the label of some arc from v. If α is the empty string then (v, α) is a ’real’ node (= v). • Let v = node(α´) in Tree(T). Then implicit node (v, α) represents node(α´α) of Trie(T) 175 Implicit node … α´ v … α (v, α) 176 Suffix links and open arcs root α … aα slink(v) v … label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T w 177 Big picture suffix link path traversed: total work O(n) … … … … … new arcs and nodes created: total work O(size(Tree(T)) 178 On-line procedure for suffix tree Input: string T = t1t2 … tn$ Output: Tree(T) Notation: son(v,α) = w iff there is an arc from v to w with label α son(v,ε) = v Function Canonize(v, α): while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do v := son(v, α´); α := α´´ return (v, α) 179 Suffix-tree on-line: main procedure Create Tree(t1); slink(root) := root (v, α) := (root, ε) /* (v, α) is the start node */ for i := 2 to n+1 do v´ := 0 while there is no arc from v with label prefix αti do if α ≠ ε then /* divide the arc w = son(v, αη) into two */ son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else son(v,ti) := v´´´; v´´ := v if v´ ≠ 0 then slink(v´) := v´´ v´ := v´´; v := slink(v); (v, α) := Canonize(v, α) if v´ ≠ 0 then slink(v´) := v (v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */ 180 http://stackoverflow.com/questions/9452701/ ukkonens-suffix-tree-algorithm-in-plain- english The actual me and space • |Tree(T)| is about 20|T| in prac ce • brute-force construc on is O(|T|log|T|) for random strings as the average depth of internal nodes is O(log|T|) • difference between linear and brute-force construc ons not necessarily large (Giegerich & Kurtz) • truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003) • alphabet independent linear me (Farach 1997) 182 abc abcabxabcd Applica ons of Suffix Trees • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computa onal Biology. Hardcover - 534 pages 1st edi on (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. - book • APL1: Exact String Matching Search for P from text S. Solu on 1: build STree(S) - one achieves the same O(n+m) as Knuth- Morris-Pra , for example! • Search from the suffix tree is O(|P|) • APL2: Exact set matching Search for a set of pa erns P Back to backtracking ACA, 1 mismatch A T C A T C A AT T A T C C C T T T 4 2 1 5 6 3 Same idea can be used to many other forms of approximate search, like Smith-Waterman, position-restricted scoring matrices, regular expression search, etc. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 190 Applica ons of Suffix Trees • APL3: substring problem for a database of pa erns Given a set of strings S=S1, ... , Sn --- a database Find all Si that have P as a substring • Generalized suffix tree contains all suffixes of all Si • Query in me O(|P|), and can iden fy the LONGEST common prefix of P in all Si Applica ons of Suffix Trees • APL4: Longest common substring of two strings • Find the longest common substring of S and T. • Overall there are poten ally O( n2 ) such substrings, if n is the length of a shorter of S and T • Donald Knuth once (1970) conjectured that linear- me algorithm is impossible. • Solu on: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring xxxxxx. Simple analysis task: LCSS • Let LCSSA(A,B) denote the longest common substring two sequences A and B. E.g.: – LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT. • A good solu on is to build suffix tree for the shorter sequence and make a descending suffix walk with the other sequence. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 193 Suffix link X aX suffix link ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 194 Descending suffix walk suffix tree of A Read B left-to-right, always going down in the tree when possible. If the next symbol of B does not match any edge label on current position, take suffix link, and try again. (Suffix link in the root to itself emits a symbol). The node v encountered with largest string depth v is the solution. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 195 Another common tool: Generalized suffix tree A node info: C subtree size 47813871 sequence count 87 C ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$ ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 196 Generalized suffix tree applica on A node info: C subtree size 4398 blue sequences 12/15 C red sequences 2/62 ..... ...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#...... #....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#...... #... ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 197 Case study con nued suffix tree of genome motif? TAC...... T 5 blue 1 red genome regions with ChIP-seq matches ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 198 Proper es of suffix tree • Suffix tree has n leaves and at most n-1 internal nodes, where n is the total length of all sequences indexed. • Each node requires constant number of integers (pointers to first child, sibling, parent, text range of incoming edge, sta s cs counters, etc.). • Can be constructed in linear me. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 199 Proper es of suffix tree... in prac ce • Huge overhead due to pointer structure: – Standard implementa on of suffix tree for human genome requires over 200 GB memory! – A careful implementa on (using log n -bit fields for each value and array layout for the tree) s ll requires over 40 GB. – Human genome itself takes less than 1 GB using 2- bits per bp. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 200 Applica ons of Suffix Trees • APL4: Longest common substring of two strings • Find the longest common substring of S and T. • Overall there are poten ally O( n2 ) such substrings, if n is the length of a shorter of S and T • Donald Knuth once (1970) conjectured that linear- me algorithm is impossible. • Solu on: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring alive. Applica ons of Suffix Trees • APL5: Recognizing DNA contamina on Related to DNA sequencing, search for longest strings (longer than threshold) that are present in the DB of sequences of other genomes. • APL6: Common substrings of more than two strings Generaliza on of APL4, can be done in linear (in total length of all strings) me Applica ons of Suffix Trees • APL7: Building a directed graph for exact matching: Suffix graph - directed acyclic word graph (DAWG), a smallest finite state automaton recognizing all suffixes of a string S. This automaton can recognize membership, but not tell which suffix was matched. • Construc on: merge isomorfic subtrees. • Isomorfic in Suffix Tree when exists suffix link path, and subtrees have equal nr. of leaves. Applica ons of Suffix Trees • APL8: A reverse role for suffix trees, and major space reduc on Index the pa ern, not tree... • Matching sta s cs. • APL10: All-pairs suffix-prefix matching For all pairs Si, Sj, find the longest matching suffix- prefix pair. Used in shortest common superstring genera on (e.g. DNA sequence assembly), EST alignmentm etc. Applica ons of Suffix Trees • APL11: Finding all maximal repe ve structures in linear me • APL12: Circular string lineariza on e.g. circular chemical molecules in the database, one wants to lienarize them in a canonical way... • APL13: Suffix arrays - more space reduc on will touch that separately Applica ons of Suffix Trees • APL14: Suffix trees in genome-scale projects • APL15: A Boyer-Moore approach to exact set matching • APL16: Ziv-Lempel data compression • APL17: Minimum length encoding of DNA Applica ons of Suffix Trees • Addi onal applica ons Mostly exercises... • Extra feature: CONSTANT me lowest common ancestor retrieval (LCA) Andmestruktuur mis võimaldab leida konstantse ajaga alumist ühist vanemat (see vastab pikimale ühisele prefixile!) on võimalik koostada lineaarse ajaga. • APL: Longest common extension: a bridge to inexact matching • APL: Finding all maximal palindromes in linear me Palindrome reads from central posi on the same to le and right. E.g.: kirik, saippuakivikauppias. • Build the suffix tree of S and inverted S (aabcbad => aabcbad#dabcbaa ) and using the LCA one can ask for any posi on pair (i, 2i-1), the longest common prefix in constant me. • The whole problem can be solved in O(n). Applica ons of Suffix Trees • APL: Exact matching with wild cards • APL: The k-mismatch problem • Approximate palindromes and repeats • Faster methods for tandem repeats • A linear- me solu on to the mul ple common substring problem • And many-many more ... 1. Suffix tree 2. Suffix array 3. Some applications 4. Finding motifs 209 Suffixes - sorted hattivatti ε attivatti atti ttivatti attivatti tivatti hattivatti ivatti i vatti ivatti atti ti tti tivatti ti tti i ttivatti ε vatti • Sort all suffixes. Allows to perform binary search! 210 Suffix array: example 1 hattivatti ε 11 11 2 attivatti atti 7 7 3 ttivatti attivatti 2 2 4 tivatti hattivatti 1 1 5 ivatti i 10 10 6 vatti ivatti 5 5 7 atti ti 9 9 8 tti tivatti 4 4 9 ti tti 8 8 10 i ttivatti 3 3 11 ε vatti 6 • suffix array = lexicographic order of the suffixes 211 Suffix array construc on: sort! hattivatti 1 11 11 attivatti 2 7 7 ttivatti 3 2 2 tivatti 4 1 1 ivatti 5 10 10 vatti 6 5 5 atti 7 9 9 tti 8 4 4 ti 9 8 8 i 10 3 3 ε 11 6 • suffix array = lexicographic order of the suffixes 212 Suffix array • suffix array SA(T) = an array giving the lexicographic order of the suffixes of T • space requirement: 5|T| • prac oners like suffix arrays (simplicity, space efficiency) • theore cians like suffix trees (explicit structure) 213 Reducing space: suffix array =[2,2] A T =[3,3] C =[5,6] A C T =[3,6] T A A A T =[6,6] =[4,6] T C C =[2,6] C T T T suffix 4 2 1 5 6 3 array C A T A C T 1 2 3 4 5 6 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 214 Suffix array • Many algorithms on suffix tree can be simulated using suffix array... – ... and couple of addi onal arrays... – ... forming so-called enhanced suffix array... – ... leading to the similar space requirement as careful implementa on of suffix tree • Not a sa sfactory solu on to the space issue. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 215 Pa ern search from suffix array hattivatti ε 11 attivatti atti 7 att ttivatti attivatti 2 binary search tivatti hattivatti 1 ivatti i 10 vatti ivatti 5 atti ti 9 tti tivatti 4 ti tti 8 i ttivatti 3 ε vatti 6 216 What we learn today? • We learn that it is possible to replace suffix trees with compressed suffix trees that take 8.8 GB for the human genome. • We learn that backtracking can be done using compressed suffix arrays requiring only 2.1 GB for the human genome. • We learn that discovering interes ng mo f seeds from the human genome takes 40 hours and requires 9.3 GB space. ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 217 Recent suffix array construc ons • Manber&Myers (1990): O(|T|log|T|) • linear me via suffix tree • January / June 2003: direct linear me construc on of suffix array - Kim, Sim, Park, Park (CPM03) - Kärkkäinen & Sanders (ICALP03) - Ko & Aluru (CPM03) 218 Kärkkäinen-Sanders algorithm 1. Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively. 2. Construct the suffix array of the remaining suffixes using the result of the first step. 3. Merge the two suffix arrays into one. 219 Nota on • string T = T[0,n) = t0t1 … tn-1 • suffix Si = T[i,0) = titi+1 … tn-1 • for C \subset [0,n]: SC = {Si | i in C} • suffix array SA[0,n] of T is a permuta on of [0,n] sa sfying SSA[0] < SSA[1] < … < SSA[n] 220 Running example 0 1 2 3 4 5 6 7 8 9 10 11 • T[0,n) = y a b b a d a b b a d o 0 0 … • SA = (12,1,6,4,9,3,8,2,7,5,10,11,0) 221 Step 0: Construct a sample • for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k} • C = B1 U B2 sample posi ons • SC sample suffixes • Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11} 222 Step 1: Sort sample suffixes • for k = 1,2, construct Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2] R = R1 ^ R2 concatena on of R1 and R2 Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ; correspondence is order preserving. Sort the suffixes of R: radix sort the characters and rename with ranks to obtain R´. If all characters different, their order directly gives the order of suffixes. Otherwise, sort the suffixes of R´ using Kärkkäinen-Sanders. Note: |R´| = 2n/3. 223 Step 1 (cont.) • once the sample suffixes are sorted, assign a rank to each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0 • Example: R = [abb][ada][bba][do0][bba][dab][bad][o00] R´ = (1,2,4,6,4,5,3,7) SAR´ = (8,0,1,6,4,2,5,3,7) rank(Si) - 1 4 - 2 6 - 5 3 – 7 8 – 0 0 224 Step 2: Sort nonsample suffixes • for each non-sample Si є SB0 (note that rank(Si+1) is always defined for i є B0): Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • radix sort the pairs (ti,rank(Si+1)). • Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1) 225 Step 3: Merge • merge the two sorted sets of suffixes using a standard comparison-based merging: • to compare Si є SC with Sj є SB0, dis nguish two cases: • i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2)) • note that the ranks are defined in all cases! • S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7) 226 Running me O(n) • excluding the recursive call, everything can be done in linear me • the recursion is on a string of length 2n/3 • thus the me is given by recurrence T(n) = T(2n/3) + O(n) • hence T(n) = O(n) 227 Implementa on • about 50 lines of C++ • code available e.g. via Juha Kärkkäinen’s home page 228 LCP table • Longest Common Prefix of successive elements of suffix array: • LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1] • build inverse array SA-1 from SA in linear me • then LCP table from SA-1 in linear me (Kasai et al, CPM2001) 229 • Oxford English Disc onary h p://www.oed.com/ • Example - Word of the Day , Fourth h p://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.html h p://www.oed.com/cgi/display/wotd • PAT index - by Gaston Gonnet (ta on samu Maple tarkvara üks loojatest ning hiljem molekulaarbioloogia tarkvarapake väljatöötajaid) • PAT index is essen ally a suffix array. To save space, indexed only from first character of every word • XML-tagging (or SGML, at that me!) also indexed • To mark certain fields of XML, the bit vectors were used. • Main concern - improve the speed of search on the CD - minimize random accesses. • For slow medium even 15-20 accesses is too slow... • G. H. Gonnet, R. A. Baeza-Yates, and T. Snider, Lexicographical indices for text: Inverted files vs. PAT trees, Technical Report OED-91-01, Centre for the New OED, University of Waterloo, 1991. Suffix tree vs suffix array • suffix tree ó suffix array + LCP table 231 1. Suffix tree 2. Suffix array 3. Some applications 4. Finding motifs 232 Substring mo fs of string T • string T = t1 … tn in alphabet A. • Problem: what are the frequently occurring (ungapped) substrings of T? Longest substring that occurs at least q mes? • Thm: Suffix tree Tree(T) gives complete occurrence counts of all substring mo fs of T in O(n) me (although T may have O(n2) substrings!) 233 Coun ng the substring mo fs • internal nodes of Tree(T) ↔ repea ng substrings of T • number of leaves of the subtree of a node for string P = number of occurrences of P in T 234 Substring mo fs of ha va vatti t i vatti 4 2 ti i atti i hattivatti 2 2 vatti ivatti 2 ti vatti vatti vatti tti atti tivatti hattivatti ttivatti attivatti Counts for the O(n) maximal motifs shown 235 Finding repeats in DNA • human chromosome 3 • the first 48 999 930 bases • 31 min cpu me (8 processors, 4 GB) • Human genome: 3x109 bases • Tree(HumanGenome) feasible 236 Longest repeat? Occurrences at: 28395980, 28401554r Length: 2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaat gctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaa catgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcat agtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacat acgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaat caccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgc cattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttctttt gagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagt aggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggct tttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatg taagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaagga agggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatc agatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagt atagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttc caattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatga gcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtatttt attctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgag actttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctct tttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgt gccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaa tttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtac gtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattt tattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatga gttagg 237 Ten occurrences? ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggat ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcc caagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagt agagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccg cccgcctcggcctcccaaagtgctgggattacaggcgt Length: 277 Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125 In the reversed complement at: 17858493, 41463059, 42431718, 42580925 238 Using suffix trees: plagiarism • find longest common substring of strings X and Y • build Tree(X$Y) and find the deepest node which has a leaf poin ng to X and another poin ng to Y 239 Using suffix trees: approximate matching • edit distance: inser ons, dele ons, changes • STOCKHOLM vs TUKHOLMA 240 String distance/similarity func ons STOCKHOLM vs TUKHOLMA STOCKHOLM_ _TU_ KHOLMA => 2 deletions, 1 insertion, 1 change 241 Approximate string matching • A: S T O C K H O L M B: T U K H O L M A • minimum number of ’muta on’ steps: a -> b a -> є є -> b … • dID(A,B) = 5 dLevenshtein(A,B)= 4 • muta on costs => probabilis c modeling • evalua on by dynamic programming => alignment 242 Dynamic programming di,j = min(if ai=bj then di-1,j-1 else ∞, di-1,j + 1, di,j-1 + 1) = distance between i-prefix of A and j-prefix of B (substitution excluded) B mxn table d bj di-1,j-1 di-1,j A +1 a d d i i,j-1 +1 i,j dm,n 243 di,j = min(if ai=bj then di-1,j-1 else ∞, di-1,j + 1, di,j-1 + 1) A\B s t o c k h o l m 0 1 2 3 4 5 6 7 8 9 t 1 2 1 2 3 4 5 6 7 8 u 2 3 2 3 4 5 6 7 8 9 k 3 4 3 4 5 4 5 6 7 8 h 4 5 4 5 6 5 4 5 6 7 o 5 6 5 4 5 6 5 4 5 6 l 6 7 6 5 6 7 6 5 4 5 m 7 8 7 6 7 8 7 6 5 4 a 8 9 8 7 8 9 8 7 6 5 optimal alignment by trace-back 244 dID(A,B) Search problem • find approximate occurrences of pa ern P in text T: substrings P’ of T such that d(P,P’) small • dyn progr with small modifica on: O(mn) • lots of (prac cal) improvement tricks T P’ P 245 Index for approximate searching? • dynamic programming: P x Tree(T) with backtracking Tree(T) P 246