Text Algorithms

Jaak Vilo 2015 fall

Jaak Vilo MTAT.03.190 Text Algorithms 1 Topics

• Exact matching of one paern(string) • Exact matching of mulple paerns • Suffix and tree indexes – Applicaons • Suffix arrays • Inverted index • Approximate matching Exact paern matching

• S = s1 s2 … sn (text) |S| = n (length)

• P = p1p2..pm (paern) |P| = m

• Σ - alphabet | Σ| = c

• Does S contain P? – Does S = S' P S" fo some strings S' ja S"?

– Usually m << n and n can be (very) large

Find occurrences in text

P

S Algorithms

One-paern Mul-paern • Brute force • Aho Corasick • Knuth-Morris-Pra • Commentz-Walter • Karp-Rabin • Shi-OR, Shi-AND • Boyer-Moore Indexing • • Factor searches Trie (and suffix trie) • Suffix tree Animaons

• hp://www-igm.univ-mlv.fr/~lecroq/string/

• EXACT STRING MATCHING ALGORITHMS Animaon in Java • Chrisan Charras - Thierry Lecroq Laboratoire d'Informaque de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-Saint-Aignan Cedex FRANCE • e-mails: {Chrisan.Charras, Thierry.Lecroq} @laposte.net Brute force: BAB in text?

ABACABABBABBBA

BAB Brute Force

P i i+j-1 S

j Idenfy the first mismatch!

Queson:

§Problems of this method? L §Ideas to improve the search? J Brute force

attempt 1: gcatcgcagagagtatacagtacg Algorithm Naive GCAg....

attempt 2: Input: Text S[1..n] and gcatcgcagagagtatacagtacg paern P[1..m] g......

attempt 3: Output: All posions i, where gcatcgcagagagtatacagtacg g...... P occurs in S attempt 4: gcatcgcagagagtatacagtacg g...... for( i=1 ; i <= n-m+1 ; i++ ) attempt 5: gcatcgcagagagtatacagtacg for ( j=1 ; j <= m ; j++ ) g......

if( S[i+j-1] != P[j] ) break ; attempt 6: gcatcgcagagagtatacagtacg GCAGAGAG if ( j > m ) print i ; attempt 7: gcatcGCAGAGAGtatacagtacg

g...... Brute force or NaiveSearch

1 funcon NaiveSearch(string s[1..n], string sub[1..m]) 2 for i from 1 to n-m+1 3 for j from 1 to m 4 if s[i+j-1] ≠ sub[j] 5 jump to next iteraon of outer loop 6 return i 7 return not found C code int bf_2( char* pat, char* text , int n ) /* n = textlen */ { int m, i, j ; int count = 0 ; m = strlen(pat);

for ( i=0 ; i + m <= n ; i++) {

for( j=0; j < m && pat[j] == text[i+j] ; j++) ;

if( j == m ) count++ ; }

return(count); } C code int bf_1( char* pat, char* text ) { int m ; int count = 0 ; char *tp;

m = strlen(pat); tp=text ;

for( ; *tp ; tp++ ) { if( strncmp( pat, tp, m ) == 0 ) { count++ ; } }

return( count ); } Main problem of Naive

P i i+j-1 S

j S j

• For the next possible locaon of P, check again the same posions of S Goals

• Make sure only a constant nr of comparisons/ operaons is made for each posion in S – Move (only) from le to right in S

– How? – Aer a test of S[i] <> P[j] what do we now ? D. Knuth, J. Morris, V. Pratt: Fast in strings. Knuth-Morris-Pra SIAM Journal on Computing 6:323-350, 1977.

• Make sure that no comparisons “wasted”

x ≠ y

• Aer such a mismatch we already know exactly the values of green area in S ! D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. Knuth-Morris-Pra SIAM Journal on Computing 6:323-350, 1977.

• Make sure that no comparisons “wasted”

prefix x ≠ prefix y ≠ p z

• P – longest suffix of any prefix that is also a prefix of a paern • Example: ABCABD ABCABD Automaton for ABCABD

NOT A

A B C A B D 1 2 3 4 5 6 7 Automaton for ABCABD

NOT A

A B C A B D 1 2 3 4 5 6 7

Fail: 0 1 1 1 2 3 1

Pattern: A B C A B D

1 2 3 4 5 6 KMP matching

Input: Text S[1..n] and paern P[1..m] Output: First occurrence of P in S (if exists) i=1; j=1; initfail(P) // Prepare fail links repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link until j>m or i>n if j>m then report match at i-m Inializaon of fail links

Algorithm: KMP_Iniail Input: Paern P[1..m] Output: fail[] for paern P i=1, j=0 , fail[1]= 0 repeat if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j else j = fail[j] until i>=m Inializaon of fail links i ABCABD i=1, j=0 , fail[1]= 0 j repeat if j==0 or P[i] == P[j]

then i++ , j++ , fail[i] = j Fail: 0 else j = fail[j] until i>=m 0 1

0 1 1 1

ABCABD

0 1 1 1 2 Time complexity of KMP matching?

Input: Text S[1..n] and paern P[1..m] Output: First occurrence of P in S (if exists) i=1; j=1; initfail(P) // Prepare fail links repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link until j>m or i>n if j>m then report match at i-m Analysis of me complexity

• At every cycle either i and j increase by 1 • Or j decreases (j=fail[j])

• i can increase n (or m) mes • Q: How oen can j decrease? – A: not more than nr of increases of i

• Amorsed analysis: O(n) , preprocess O(m) Karp-Rabin R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.

• Compare in O(1) a hash of P and S[i..i+m-1]

i .. (i+m-1)

h(T[i.. i+m-1])

h(P)

1..m • Goal: O(n). • f( h(T[i.. i+m-1]) -> h(T[i+1..i+m]) ) = O(1) Karp-Rabin R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.

• Compare in O(1) a hash of P and S[i..i+m-1]

i .. (i+m-1) i .. (i+m-1)

h(T[i+1..i+m])

h(P)

1..m • Goal: O(n). • f( h(T[i.. i+m-1]) -> h(T[i+1..i+m]) ) = O(1) Hash

• “Remove” the effect of T[i] and “Introduce” the effect of T[i+m] – in O(1)

• Use base |Σ| arithmecs and treat charcters as numbers

• In case of hash match – check all m posions • Hash collisions => Worst case O(nm) Let’s use numbers

• T= 57125677 • P= 125 (and for simplicity, h=125)

• H(T[1])=571 • H(T[2])= (571-5*100)*10 + 2 = 712

• H(T[3]) = (H(T[2]) – ord(T[1])*10m)*10 + T[3+m-1] hash

• c – size of alphabet

• HSi = H( S[i..i+m-1] )

• H( S[i+1 .. i+m ] ) = ( HSi – ord(S[i])* cm-1 ) * c + ord( S[i+m] )

• Modulo arithmec – to fit value in a word! • hash(w[0 .. m-1])=(w[0]*2m-1+ w[1]*2m-2+···+ w[m-1]*20) mod q Karp-Rabin

Input: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */ 2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0 5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position

8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q 10. if hp == hs and P == S[i..i+m-1] 11. report match at position i More ways to ensure O( n ) ? Shi-AND / Shi-OR

• Ricardo Baeza-Yates , Gaston H. Gonnet A new approach to text searching Communicaons of the ACM October 1992, Volume 35 Issue 10 [ACM Digital Library:hp://doi.acm.org/10.1145/135239.135243] [DOI] • PDF Bit-operaons

• Maintain a set of all prefixes that have so far had a perfect match

• On the next character in text update all previous pointers to a new set

• Bit vector: for every possible character State: which prefixes match?

0

1 0 0 1 Move to next: Track posions of prefix matches

0 1 0 1 0 1

1 0 1 0 1 1 Shift left <<

Mask on char T[i] 1 0 0 0 1 1 Bitwise AND

1 0 0 0 1 1 Vectors for every char in Σ

• P=aste

a s t e b c d .. z 1 0 0 0 0 ... 0 1 0 0 0 ... 0 0 1 0 0 ... 0 0 0 1 0 ...

• T= lasteaed

l a s t e a e d 0 1 0 0 0 0 0 0

• T= lasteaed

l a s t e a e d 0 1 0 0 0 1 0 0 0 0 0 0

• T= lasteaed

l a s t e a e d 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0

• T= lasteaed

l a s t e a e d 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 http://www-igm.univ-mlv.fr/~lecroq/string/node6.html http://www-igm.univ-mlv.fr/~lecroq/string/node6.html

http://www-igm.univ-mlv.fr/~lecroq/string/node6.html

[A] 1 1 0 1 0 1 0 1 Summary

Algorithm Worst case Ave. Case Preprocess

Brute force O(mn) O(n * (1+1/|Σ|+..)

Knuth-Morris-Pra O(n) O(n) O(m)

Rabin-Karp O(mn) O(n) O(m)

Boyer-Moore O(n/m) ?

BM Horspool

Factor search

Shi-OR O(n) O(n) O( m|Σ| )

• R. Boyer, S.Moore: A fast string searching algorithm. CACM 20 (1977), 762-772 [PDF] • http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf

47 Find occurrences in text

P

S

• Have we missed anything?

48 Find occurrences in text

P

S ABCDE BBCDE

• What have we learned if we test for a potenal match from the end?

49 50 Find occurrences in text

P

S A B

51 Bad character heuriscs maximal shi on S[i]

P S[i]

S A X B

S X X First x in pattern (from end)

delta1( S[i] ) – |m| if pattern does not contain S[i] patlen-j max j so that P[j] == S[i]

52 void bmInitocc() { char a; int j; for(a=0; a

53 Good suffix heuriscs

P

S A µ B

1. S µ µ

2. S µ’ µ

delta2( S[i] ) – minimal shift so that matched region is fully covered or that the sufix of match is also a prefix of P

54 Boyer-Moore algorithm

Input: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S preprocess_BM() // delta1 and delta2 i=m while i <= n for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ; if j==0 report match at position i-m+1 i = i+ max( delta1[ S[i] ], delta2[ j ] )

55 • hp://www.i.-flensburg.de/lang/algorithmen/paern/bmen.htm

• hp://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Arcles/Exact/Boyer-Moore-original-p762-boyer.pdf

• Animaon: hp://www-igm.univ-mlv.fr/~lecroq/string/

56 Simplificaons of BM

• There are many variants of Boyer-Moore, and many scienfic papers. • On average the me complexity is sublinear • Algorithm speed can be improved and yet simplify the code. • It is useful to use the last character heuriscs (Horspool (1980), Baeza-Yates(1989), Hume and Sunday(1991)).

57 Algorithm BMH (Boyer-Moore-Horspool)

• RN Horspool - Praccal Fast Searching in Strings Soware - Pracce and Experience, 10(6):501-506 1980

Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j

3. i=m 4. while i <= n 5. if S[i] == P[m] 6. j = m-1 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ]

58 String Matching: Horspool algorithm

• How the comparison is made?

Text : Pattern : From right to left: suffix search

• Which is the next position of the window?

Text : a Pattern : It depends of where appears the last letter of the text, say it ‘a’, in the pattern: a a a a a a a a a Then it is necessary a preprocess that determines the length of the shift. Algorithm Boyer-Moore-Horspool-Hume-Sunday (BMHHS) • Use delta in a ght loop • If match (delta==0) then check and apply original delta d

Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m] 4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 0 5. for ( i=m ; i<= n ; i = i+d ) 6. repeat // skip loop 7. t=delta[ S[i] ] ; i = i + t 8. until t==0 9. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ; 10. if j==0 report match at i-m+1

BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P (in order for the algorithm to finish correctly – at least one occurrence!). 60 • Daniel M. Sunday: A very fast search algorithm [PDF] Communicaons of the ACM August 1990, Volume 33 Issue 8

• Loop unrolling: • Avoid too many loops (each loop requires tests) by just repeang code within the loop. • Line 7 in previous algorithm can be replaced by:

7. i += delta[ S[i] ]; i += delta[ S[i] ]; i += (t = delta[ S[i] ]) ;

61 62 Forward-Fast-Search: Another Fast Variant of the Boyer-Moore String Matching Algorithm

• The Prague Stringology Conference '03 • Domenico Cantone and Simone Faro • Abstract: We present a variaon of the Fast-Search string matching algorithm, a recent member of the large family of Boyer-Moore-like algorithms, and we compare it with some of the most effecve string matching algorithms, such as Horspool, Quick Search, Tuned Boyer- Moore, Reverse Factor, Berry-Ravindran, and Fast-Search itself. All algorithms are compared in terms of run-me efficiency, number of text character inspecons, and number of character comparisons. It turns out that our new proposed variant, though not linear, achieves very good results especially in the case of very short paerns or small alphabets. • hp://cs.felk.cvut.cz/psc/event/2003/p2.html • PS.gz (local copy)

63 Factor based approach

• Opmal average-case algorithms – Assuming independent characters, same probability

• Factor – a substring of a paern – Any substring – (how many?)

64 Factor based approach

• Opmal average-case algorithms – Assuming independent characters, same probability

65 Factor searches

P u S X

Do not compare characters, but find the longest match to any subregion of the pattern.

66 Examples

• Backward DAWG Matching (BDM) – Crochemore et al 1994 • Backward Nondeterminisc DAWG Matching (BNDM) – Navarro, Raffinot 2000 • Backward Oracle Matching (BOM) – Allauzen, Crochermore, Raffinot 2001

67 Backward DAWG Matching BDM recognises all factors (and suffixes) in O(n)

Do not compare characters, but find the longest match to any subregion of the pattern. 68 BNDM – simulate using bitparallelism Bits – show where the factors have occurred so far

69 BNDM matches an NDA

NDA on the suffixes of ‘announce’

70 Determinisc version of the same Backward Factor Oracle

71 BNDM – Backward Non-Deterministic DAWG Matching BOM - Backward Oracle matching

72 String Matching of one paern

CTACTACTACGTCTATACTGATCGTAGC

TACTACGGTATGACTAA

1. Prefix search

2. Suffix search

3. Factor search Mulple paerns

S

{P} Why?

• Mulple paerns • Highlight mulple different search words on the page • Virus detecon – filter for virus signatures • Spam filters • Scanner in compiler needs to search for mulple keywords • Filter out stop words or disallowed words • Intrusion detecon soware • Next-generaon sequencing produces huge amounts (many millions) of short reads (20-100 bp) that need to be mapped to genome! • … Algorithms

• Aho-Corasick (search for mulple words) – Generalizaon of Knuth-Morris-Pra • Commentz-Walter – Generalizaon of Boyer-Moore & AC • Wu and Manber – improvement over C-W

• Addional methods, tricks and techniques Aho-Corasick (AC)

• Alfred V. Aho and Margaret J. Corasick (Bell Labs, Murray Hill, NJ) Efficient string matching. An aid to bibliographic search. Communicaons of the ACM, Volume 18 , Issue 6, p333-340 (June 1975) • ACM:DOI PDF • ABSTRACT This paper describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of construcng a finite state paern matching machine from the keywords and then using the paern matching machine to process the text string in a single pass. Construcon of the paern matching machine takes me proporonal to the sum of the lengths of the keywords. The number of state transions made by the paern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.

References: • Generalizaon of KMP for many paerns • Text S like before.

• Set of paerns P = { P1 , .., Pk }

• Total length |P| = m = Σ i=1..k mi • Problem: find all occurrences of any of the

Pi ∈ P from S Idea

1. Create an automaton from all paerns

2. Match the automaton

• Use the PATRICIA trie for creang the main structure of the automaton PATRICIA trie

• D. R. Morrison, "PATRICIA: Praccal Algorithm To Retrieve Informaon Coded In Alphanumeric", Journal of the ACM 15 (1968) 514-534. • Abstract PATRICIA is an algorithm which provides a flexible means of storing, indexing, and retrieving informaon in a large file, which is economical of index space and of reindexing me. It does not require rearrangement of text or index as new material is added. It requires a minimum restricon of format of text and of keys; it is extremely flexible in the variety of keys it will respond to. It retrieves informaon in response to keys furnished by the user with a quanty of computaon which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several variaons as FORTRAN programs for the CDC-3600, ulizing disk file storage of text. It has been applied to several large informaon-retrieval problems and will be applied to others. • ACM:DOI PDF

• Word trie - a good to represent a set of words (e.g. a diconary). • trie (data structure)

• Definion: A tree for storing strings in which there is one node for every common preffix. The strings are stored in extra leaf nodes. • See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree. • Note: The name comes from retrieval and is pronounced, "tree." • To test for a word p, only O(|p|) me is used no maer how many words are in the diconary... Trie for P={he, she, his, hers}

0 0

h h s

1 1 3

e e h

2 2 4

e

5 Trie for P={he, she, his, hers}

0

h s

1 3 i e h

2 6 4 r s e

8 5 s 7

9 How to search for words like he, sheila, hi. Do these occur in the trie?

0

h s

1 3 i e h

2 6 4 r s e

8 5 s 7

9 Aho-Corasick

1. Create an automaton MP for a set of strings P. 2. Finite state machine: read a character from text, and change the state of the automaton based on the state transions... 3. Main links: goto[j,c] - read a character c from text and go from a state j to state goto[j,c]. 4. If there are no goto[j,c] links on character c from state j, use fail[j]. 5. Report the output. Report all words that have been found in state j. AC Automaton (vs KMP)

NOT { h, s } goto[1,i] = 6. ; fail[7] = 3, 0 fail[8] = 0 , h s fail[5]=2. 1 3 e i h

2 Output table 4 r state output[j] e 2 he 6 5 she, he 8 5 s s 7 his 9 hers 9 7 AC - matching

Input: Text S[1..n] and an AC automaton M for paern set P Output: Occurrences of paerns from P in S (last posion) 1. state = 0 2. for i = 1..n do 3. while (goto[state,S[i]]== ∅ ) and (fail[state]!=state) do 4. state = fail[state] 5. state = goto[state,S[i]] 6. if ( output[state] not empty ) 7. then report matches output[state] at posion i Algorithm Aho-Corasick preprocessing I (TRIE)

Input: P = { P1, ... , Pk } Output: goto[] and paral output[] Assume: output(s) is empty when a state s is created; goto[s,a] is not defined.

procedure enter(a1, ... , am) /* Pi = a1, ... , am */ begin begin 10. news = 0 1. s=0; j=1; 11. for i=1 to k do enter( Pi ) 12. for a ∈ Σ do 2. while goto[s,aj] ≠ ∅ do // follow exisng path 3. s = goto[s,a ]; j ∅ 4. j = j+1; 13. if goto[0,a] = then goto[0,a] = 0 ; end 5. for p=j to m do // add new path (states)

6. news = news+1 ;

7. goto[s,ap] = news; 8. s = news;

9. output[s] = a1, ... , am end

Preprocessing II for AC (FAIL)

queue = ∅ for a ∈ Σ do if goto[0,a] ≠ 0 then enqueue( queue, goto[0,a] ) fail[ goto[0,a] ] = 0

while queue ≠ ∅ r = take( queue ) for a ∈ Σ do if goto[r,a] ≠ ∅ then s = goto[ r, a ] enqueue( queue, s ) // breadth first search state = fail[r] while goto[state,a] = ∅ do state = fail[state] fail[s] = goto[state,a] output[s] = output[s] + output[ fail[s] ] Correctness

• Let string t "point" from inial state to state j.

• Must show that fail[j] points to longest suffix that is also a prefix of some word in P.

• Look at the arcle... AC matching me complexity

• Theorem For matching the MP on text S, |S|=n, less than 2n transions within M are made. • Proof Compare to KMP. • There is at most n goto steps. • Cannot be more than n Fail-steps. • In total -- there can be less than 2n transions in M. Individual node (goto)

• Full table

• List

• Binary search tree(?)

• Some other index? AC thoughts

• Scales for many strings simultaneously. • For very many paerns – search me (of grep) improves(??) – See Wu-Manber arcle • When k grows, then more fail[] transions are made (why?) • But always less than n. • If all goto[j,a] are indexed in an array, then the size is

|MP|*|Σ|, and the running me of AC is O(n). • When k and c are big, one can use lists or trees for storing transion funcons. • Then, O(n log(min(k,c)) ). Advanced AC

• Precalculate the next state transion correctly for every possible character in alphabet

• Can be good for short paerns Problems of AC?

• Need to rebuild on adding / removing paerns

• Details of branching on each node(?) Commentz-Walter

• Generalizaon of Boyer-Moore for mulple sequence search • Beate Commentz-Walter A String Matching Algorithm Fast on the Average Proceedings of the 6th Colloquium, on Automata, Languages and Programming. Lecture Notes In Computer Science; Vol. 71, 1979. pp. 118 - 132, Springer-Verlag

• hp://www.-albsig.de/win/personen/professoren.php?RID=36 • You can download here my algorithm StringMatchingFastOnTheAverage (PDF, ~17,2 MB) or here StringMatchingFastOnTheAverage (extended abstract) (PDF, ~3 MB) C-W descripon

• Aho and Corasick [AC75] presented a linear-me algorithm for this problem, based on an automata approach. This algorithm serves as the basis for the UNIX tool fgrep. A linear-me algorithm is opmal in the worst case, but as the regular string-searching algorithm by Boyer and Moore [BM77] demonstrated, it is possible to actually skip a large poron of the text while searching, leading to faster than linear algorithms in the average case. Commentz-Walter [CW79]

• Commentz-Walter [CW79] presented an algorithm for the mul-paern matching problem that combines the Boyer- Moore technique with the Aho-Corasick algorithm. The Commentz-Walter algorithm is substanally faster than the Aho-Corasick algorithm in pracce. Hume [Hu91] designed a tool called gre based on this algorithm, and version 2.0 of fgrep by the GNU project [Ha93] is using it.

• Baeza-Yates [Ba89] also gave an algorithm that combines the Boyer-Moore-Horspool algorithm [Ho80] (which is a slight variaon of the classical Boyer-Moore algorithm) with the Aho-Corasick algorithm. Idea of C-W

• Build a backward trie of all keywords

• Match from the end unl mismatch...

• Determine the shi based on the combinaon of heuriscs Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG 1. Build the trie of the inverted patterns

T G T A A T G G T A

T A A T A A 2. lmin=4 A 1 3. Table of shifts C 4 (lmin) G 2 T 1 4. Start the search

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A T A A T A A 1 A C 4 (lmin) The text ACATGCTATGTGACA… G 2 T 1 Short Shifts!

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) What are the possible limitaons for C-W?

• Many paerns, small alphabet – minimal skips

• What can be done differently? Wu-Manber

• Wu S., and U. Manber, "A Fast Algorithm for Mul-Paern Searching," Technical Report TR-94-17, Department of Computer Science, University of Arizona (May 1993). • Citeseer: hp://citeseer.ist.psu.edu/wu94fast.html [Postscript] • We present a different approach that also uses the ideas of Boyer and Moore. Our algorithm is quite simple, and the main engine of it is given later in the paper. An earlier version of this algorithm was part of the second version of agrep [WM92a, WM92b], although the algorithm has not been discussed in [WM92b] and only briefly in [WM92a]. The current version is used in glimpse [MW94]. The design of the algorithm concentrates on typical searches rather than on worst-case behavior. This allows us to make some engineering decisions that we believe are crucial to making the algorithm significantly faster than other algorithms in pracce. Key idea

• Main problem with Boyer-Moore and many paerns is that, the more there are paerns, the shorter become the possible shis...

• Wu and Manber: check several characters simultaneously, i.e. increase the alphabet. • Instead of looking at characters from the text one by one, we consider them in blocks of size B.

• logc2M; in pracce, we use either B = 2 or B = 3. • The SHIFT table plays the same role as in the regular Boyer-Moore algorithm, except that it determines the shi based on the last B characters rather than just one character. Horspool to Wu-Manber How do we can increase the length of the shifts?

With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG 1 símbol 2 símbols A 1 C 4 (lmin) AA 1 G 2 AC 3 (LMIN-L+1) T 1 AG 3 AT 1 AA 1 CA 3 AT 1 CC 3 GT 1 CG 3 TA 2 … TG 2

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T G G T A

T A A T A AA 1 A AT 1 GT 1 into the text: ACATGCTATGTGACATAATA TA 2 TG 2

Experimental length: log|Σ| 2*lmin*r Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Backward Oracle

• Set Backwards oracle SBDM, SBOM

• Pages 68-72

String matching of many paerns | Σ| 8 (5 patterns) Wu-Manber 4 SBOM 2 Lmin 5 10 15 20 25 30 35 40 45

8 Wu-Manber (10 patterns) 4 SBOM 2 5 10 15 20 25 30 35 40 45

8 Wu-Manber (100 patterns) 4 SBOM 2 Slides courtesy of: 5Xavier Messeguer 10 Peypoch (http:// 15www.lsi.upc.es /~alggen 20 ) 25 30 35 40 45 String matching of many paerns | Σ| 8 (5 patterns) Wu-Manber 4 SBOM 2 Lmin 5 10 15 20 25 30 35 40 45

8 Wu-Manber (10 patterns) 4 SBOM 2 5 10 15 20 25 30 35 40 45 (1000 patterns) 8 SBOM (100 patterns) 4

2 Slides courtesy of: 5Xavier Messeguer 10 Peypoch (http:// 15www.lsi.upc.es /~alggen 20 ) 25 30 35 40 45 5 strings 10 strings 100 strings 1000 strings Factor Oracle Factor Oracle: safe shi Factor Oracle:

Shift to match prefix of P2? Factor oracle Construcon of factor Oracle Factor oracle

• Allauzen, C., Crochemore, M., and Raffinot, M. 1999. Factor Oracle: A New Structure for Paern Matching. In Proceedings of the 26th Conference on Current Trends in theory and Pracce of informacs on theory and Pracce of informacs (November 27 - December 04, 1999). J. Pavelka, G. Tel, and M. Bartosek, Eds. Lecture Notes In Computer Science, vol. 1725. Springer- Verlag, London, 295-310. • hp://portal.acm.org/citaon.cfm? id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=6 1811641#

• hp://www-igm.univ-mlv.fr/~allauzen/work/sofsem.ps

So far

• Generalised KMP -> AhoCorasick • Generalised Horspool -> CommentzWalter, WuManber • BDM, BOM -> Set Backward Oracle Matching…

• Other generalisaons? Mulple Shi-AND

• P={P1, P2, P3, P4}. Generalize Shi-AND

• Bits = P4 P3 P2 P1

• Start = 1 1 1 1

• Match= 1 1 1 1 Text Algorithms (6EAP)

Full text indexing

Jaak Vilo 2010 fall

Jaak Vilo MTAT.03.190 Text Algorithms 133 Problem

• Given P and S – find all exact or approximate occurrences of P in S

• You are allowed to preprocess S (and P, of course)

• Goal: to speed up the searches E.g. Diconary problem

• Does P belong to a diconary D={d1,…,dn} – Build a binary search tree of D – B-Tree of D – Hashing – Sorng + Binary search • Build a keyword trie: search in O(|P|) – Assuming alphabet has up to a constant size c – See Aho-Corasick algorithm, Trie construcon Sorted array and binary search

1 13

global info head search he stop header informal hers index happy his show Sorted array and binary search

1 13

global info head search he stop header informal hers index happy his show

O( |P| log n ) Trie for D={ he, hers, his, she}

0

h s

1 3 i e h

2 6 4 r s e

8 5 s 7

9 O( |P| ) S != set of words

• S of length n

• How to index?

• Index from every posion of a text

• Prefix of every possible suffix is important Trie(babaab)

a b babaab b a abaab a a b baab b aab a a a ab b b a b b Suffix tree

• Definion: A compact representaon of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents. • Definion ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a rooted, labeled tree with a leaf for each non-empty suffix of S. Furthermore, a suffix tree sasfies the following properes: • Each internal node, other than the root, has at least two children; • Each edge leaving a parcular node is labeled with a non-empty substring of S of which the first symbol is unique among all first symbols of the edge labels of the edges leaving this parcular node; • For any leaf in the tree, the concatenaon of the edge labels on the path from the root to this leaf exactly spells out a non-empty suffix of s. • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computaonal Biology. Hardcover - 534 pages 1st edion (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. Literature on suffix trees • hp://en.wikipedia.org/wiki/Suffix_tree • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computaonal Biology. Hardcover - 534 pages 1st edion (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: 89--208) • E. Ukkonen. On-line construcon of suffix trees. Algorithmica, 14:249-60, 1995. hp://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751 • Ching-Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Construcng Suffix Tree for Gigabyte Sequences with Megabyte Memory," IEEE Transacons on Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-105, January, 2005. hp://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3 • CPM arcles archive: hp://www.cs.ucr.edu/~stelo/cpm/ • Mark Nelson. Fast String Searching With Suffix Trees Dr. Dobb's Journal, August, 1996. hp://www.dogma.net/markn/arcles/suffixt/suffixt.htm • STXXL : Standard Template Library for Extra Large Data Sets. • hp://stxxl.sourceforge.net/

• ANSI C implementaon of a Suffix Tree • hp://yeda.cs.technion.ac.il/~yona/suffix_tree/

These are old links – check for newer… http://stackoverflow.com/questions/9452701/ ukkonens-suffix-tree-algorithm-in-plain- english The suffix tree Tree(T) of T

• data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T • linear size: |Tree(T)| = O(|T|) • can be constructed in linear me O(|T|) • has myriad virtues (A. Apostolico) • is well-known: 366 000 Google hits

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt145

Suffix trie and suffix tree abaab baab aab ab b Trie(abaab) Tree(abaab)

a b a baab b a a ab a baab b a a b b

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt146

Suffix tree and suffix array techniques for paern analysis in strings

Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt147

Algorithms for combinatorial string matching? • deep beauty? +- • shallow beauty? + • applicaons? ++ • intensive algorithmic miniatures • sources of new problems: text processing, DNA, music,…

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt148

High-throughput genome-scale sequence analysis and mapping using compressed data Veli Mäkinen Department of Computer Science structures University of Helsinki ttttttttttttttgagacggagtctcgctctg tcgcccaggctggagtgcagtggcggg atctcggctcactgcaagctccgcctcc cgggttcacgccattctcctgcctcagcc tcccaagtagctgggactacaggcgcc cgccactacgcccggctaattttttgtattt ttagtagagacggggtttcaccgttttagc cgggatggtctcgatctcctgacctcgtg atccgcccgcctcggcctcccaaagtgc tgggattacaggcgt E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt150

Analysis of a string of symbols

• T = h a t t i v a t t i ’text’ • P = a t t ’paern’

• Find the occurrences of P in T: h a t t i v a t t i

• Paern synthesis: #(t) = 4 #(a) = 2 #(t****t) = 2

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt151

Soluon: backtracking with suffix tree

...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG.....

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 152 Paern finding & synthesis problems

• T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet

• Indexing problem: Preprocess T (build an index structure) such that the occurrences of different paerns P can be found fast – stac text, any given paern P

• Paern synthesis problem: Learn from T new paerns that occur surprisingly oen

• What is a paern? Exact substring, approximate

substring, with generalized symbols, with gaps, … 153 1. 2. 3. Some applications 4. Finding motifs

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt154

The suffix tree Tree(T) of T

• data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T • linear size: |Tree(T)| = O(|T|) • can be constructed in linear me O(|T|) • has myriad virtues (A. Apostolico) • is well-known: 366 000 Google hits

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt155

Suffix trie and suffix tree abaab baab aab ab b Trie(abaab) Tree(abaab)

a b a baab b a a ab a baab b a a b b

156 Trie(T) can be large

• |Trie(T)| = O(|T|2) • bad example: T = anbn • Trie(T) can be seen as a DFA: language accepted = the suffixes of T • minimize the DFA => directed cyclic word graph (’DAWG’)

157 Tree(T) is of linear size

• only the internal branching nodes and the leaves represented explicitly • edges labeled by of T • v = node(α) if the path from root to v spells α • one-to-one correspondence of leaves and suffixes • |T| leaves, hence < |T| internal nodes • |Tree(T)| = O(|T| + size(edge labels))

158 Tree(hava) hattivatti vatti attivatti t i vatti ttivatti ti i tivatti atti i vatti ivatti hattivatti ivatti vatti ti vatti vatti atti vatti tti tti atti tivatti ti hattivatti ttivatti attivatti i

159 Tree(hava) hattivatti vatti attivatti t i vatti ttivatti ti i tivatti atti i vatti ivatti hattivatti ivatti vatti ti vatti vatti atti vatti tti tti atti tivatti ti hattivatti ttivatti attivatti i substring labels of edges represented as hattivatti pairs of pointers 160 Tree(hava) hattivatti vatti attivatti i 3,3 6 ttivatti 4,5 10 tivatti 2,5 i vatti ivatti hattivatti vatti 9 5 6,10 vatti atti 6,10 8 tti 7 4 ti 1 3 2 i

hattivatti 161 Tree(T) is full text index Tree(T)

P P occurs in T at locations 8, 31, …

31 8 P occurs in T ó P is a prefix of some suffix of T ó Path for P exists in Tree(T) All occurrences of P in time O(|P| + #occ) 162 Find a from Tree(hava) hattivatti vatti attivatti t vatti ttivatti ti i tivatti atti i vatti ivatti hattivatti ivatti vatti ti vatti vatti atti vatti tti tti atti 7 tivatti ti hattivatti ttivatti attivatti i 2

163 Linear me construcon of Tree(T)

hattivatti attivatti ttivatti tivatti

ivatti Weiner McCreight vatti (1973), (1976) atti ’algorithm tti of the ti year’

’on-line’ algorithm i

(Ukkonen 1992) 164 On-line construcon of Trie(T)

• T = t1t2 … tn$

• Pi = t1t2 … ti i:th prefix of T

• on-line idea: update Trie(Pi) to Trie(Pi+1) • => very simple construcon

165 Trie(abaab)

Trie(a) Trie(ab) Trie(aba) abaa a a b a b baa b b aa a εa a ε chain of links connects the end points of current suffixes

166 Trie(abaab)

Trie(abaa) a a b a b a b b b b a a a a a a a

167 Trie(abaab)

Trie(abaa) a a b a b a b b b b a a a a a a a

Add next symbol = b

168 Trie(abaab)

Trie(abaa) a a b a b a b b b b a a a a a a a

Add next symbol = b

From here on b-arc already exists

169 Trie(abaab)

a a b a b a b b b b a a a a a a a Trie(abaab) a b b a a a b a a b b

170 What happens in Trie(Pi) => Trie(Pi+1) ?

Before a From here on the ai-arc i exists already => stop updating here

After ai ai ai ai ai New suffix links New nodes

171 What happens in Trie(Pi) => Trie(Pi+1) ?

• me: O(size of Trie(T)) • suffix links: slink(node(aα)) = node(α)

172 On-line procedure for suffix trie

1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root 2. for i := 2 to n do begin

3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf)

4. v := vi-1; v´ := 0

5. while node v has no outgoing arc for ti do begin

6. Create a new node v´´ and an arc son(v,ti) = v´´ 7. if v´ ≠ 0 then slink(v) := v´´ 8. v := slink(v); v´ := v´´ end

9. for the node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´

173 Suffix trees on-line

• ’compacted version’ of the on-line trie construcon: simulate the construcon on the linear size tree instead of the trie => me O(| T|) • all trie nodes are conceptually sll needed => implicit and real nodes

174 Implicit and real nodes

• Pair (v,α) is an implicit node in Tree(T) if v is a node of Tree and α is a (proper) prefix of the label of some arc from v. If α is the empty string then (v, α) is a ’real’ node (= v). • Let v = node(α´) in Tree(T). Then implicit node (v, α) represents node(α´α) of Trie(T)

175 Implicit node

… α´

v … α (v, α)

176 Suffix links and open arcs

root

α … aα

slink(v) v … label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T

w

177 Big picture

suffix link path traversed: total work O(n)

… … … …

new arcs and nodes created:

total work O(size(Tree(T)) 178 On-line procedure for suffix tree

Input: string T = t1t2 … tn$ Output: Tree(T) Notation: son(v,α) = w iff there is an arc from v to w with label α son(v,ε) = v

Function Canonize(v, α): while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do v := son(v, α´); α := α´´ return (v, α)

179 Suffix-tree on-line: main procedure

Create Tree(t1); slink(root) := root (v, α) := (root, ε) /* (v, α) is the start node */ for i := 2 to n+1 do v´ := 0

while there is no arc from v with label prefix αti do if α ≠ ε then /* divide the arc w = son(v, αη) into two */

son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else

son(v,ti) := v´´´; v´´ := v if v´ ≠ 0 then slink(v´) := v´´ v´ := v´´; v := slink(v); (v, α) := Canonize(v, α) if v´ ≠ 0 then slink(v´) := v

(v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */ 180 http://stackoverflow.com/questions/9452701/ ukkonens-suffix-tree-algorithm-in-plain- english The actual me and space

• |Tree(T)| is about 20|T| in pracce • brute-force construcon is O(|T|log|T|) for random strings as the average depth of internal nodes is O(log|T|) • difference between linear and brute-force construcons not necessarily large (Giegerich & Kurtz) • truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003) • alphabet independent linear me (Farach 1997)

182 abc abcabxabcd

Applicaons of Suffix Trees

• Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computaonal Biology. Hardcover - 534 pages 1st edion (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. - book

• APL1: Exact String Matching Search for P from text S. Soluon 1: build STree(S) - one achieves the same O(n+m) as Knuth- Morris-Pra, for example! • Search from the suffix tree is O(|P|) • APL2: Exact set matching Search for a set of paerns P Back to backtracking

ACA, 1 mismatch A T C

A T C A AT T A T C C C T T T 4 2 1 5 6 3

Same idea can be used to many other forms of approximate search, like Smith-Waterman, position-restricted scoring matrices, search, etc.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 190 Applicaons of Suffix Trees

• APL3: substring problem for a database of paerns Given a set of strings S=S1, ... , Sn --- a database Find all Si that have P as a substring • Generalized suffix tree contains all suffixes of all Si • Query in me O(|P|), and can idenfy the LONGEST common prefix of P in all Si Applicaons of Suffix Trees

• APL4: Longest common substring of two strings • Find the longest common substring of S and T. • Overall there are potenally O( n2 ) such substrings, if n is the length of a shorter of S and T • Donald Knuth once (1970) conjectured that linear- me algorithm is impossible. • Soluon: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring xxxxxx. Simple analysis task: LCSS

• Let LCSSA(A,B) denote the longest common substring two sequences A and B. E.g.: – LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT. • A good soluon is to build suffix tree for the shorter sequence and make a descending suffix walk with the other sequence.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 193 Suffix link

X

aX

suffix link

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 194 Descending suffix walk

suffix tree of A Read B left-to-right, always going down in the tree when possible. If the next symbol of B does not match any edge label on current position, take suffix link, and try again. (Suffix link in the root to itself emits a symbol). The node v encountered with largest string depth v is the solution.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 195 Another common tool: Generalized suffix tree

A node info: C subtree size 47813871 sequence count 87 C

ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 196 Generalized suffix tree applicaon

A node info: C subtree size 4398 blue sequences 12/15 C red sequences 2/62 .....

...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#...... #....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#...... #...

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 197 Case study connued

suffix tree of genome

motif? TAC...... T

5 blue 1 red

genome

regions with ChIP-seq matches

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 198 Properes of suffix tree

• Suffix tree has n leaves and at most n-1 internal nodes, where n is the total length of all sequences indexed. • Each node requires constant number of integers (pointers to first child, sibling, parent, text range of incoming edge, stascs counters, etc.). • Can be constructed in linear me.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 199 Properes of suffix tree... in pracce • Huge overhead due to pointer structure: – Standard implementaon of suffix tree for human genome requires over 200 GB memory! – A careful implementaon (using log n -bit fields for each value and array layout for the tree) sll requires over 40 GB. – Human genome itself takes less than 1 GB using 2- bits per bp.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 200 Applicaons of Suffix Trees

• APL4: Longest common substring of two strings • Find the longest common substring of S and T. • Overall there are potenally O( n2 ) such substrings, if n is the length of a shorter of S and T • Donald Knuth once (1970) conjectured that linear- me algorithm is impossible. • Soluon: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring alive. Applicaons of Suffix Trees

• APL5: Recognizing DNA contaminaon Related to DNA sequencing, search for longest strings (longer than threshold) that are present in the DB of sequences of other genomes. • APL6: Common substrings of more than two strings Generalizaon of APL4, can be done in linear (in total length of all strings) me Applicaons of Suffix Trees

• APL7: Building a directed graph for exact matching: Suffix graph - directed acyclic word graph (DAWG), a smallest finite state automaton recognizing all suffixes of a string S. This automaton can recognize membership, but not tell which suffix was matched. • Construcon: merge isomorfic subtrees. • Isomorfic in Suffix Tree when exists suffix link path, and subtrees have equal nr. of leaves. Applicaons of Suffix Trees

• APL8: A reverse role for suffix trees, and major space reducon Index the paern, not tree... • Matching stascs. • APL10: All-pairs suffix-prefix matching For all

pairs Si, Sj, find the longest matching suffix- prefix pair. Used in shortest common superstring generaon (e.g. DNA sequence assembly), EST alignmentm etc. Applicaons of Suffix Trees

• APL11: Finding all maximal repeve structures in linear me • APL12: Circular string linearizaon e.g. circular chemical molecules in the database, one wants to lienarize them in a canonical way... • APL13: Suffix arrays - more space reducon will touch that separately Applicaons of Suffix Trees

• APL14: Suffix trees in genome-scale projects • APL15: A Boyer-Moore approach to exact set matching • APL16: Ziv-Lempel data compression • APL17: Minimum length encoding of DNA Applicaons of Suffix Trees

• Addional applicaons Mostly exercises... • Extra feature: CONSTANT me lowest common ancestor retrieval (LCA) Andmestruktuur mis võimaldab leida konstantse ajaga alumist ühist vanemat (see vastab pikimale ühisele prefixile!) on võimalik koostada lineaarse ajaga. • APL: Longest common extension: a bridge to inexact matching • APL: Finding all maximal palindromes in linear me Palindrome reads from central posion the same to le and right. E.g.: kirik, saippuakivikauppias. • Build the suffix tree of S and inverted S (aabcbad => aabcbad#dabcbaa ) and using the LCA one can ask for any posion pair (i, 2i-1), the longest common prefix in constant me. • The whole problem can be solved in O(n). Applicaons of Suffix Trees

• APL: Exact matching with wild cards • APL: The k-mismatch problem • Approximate palindromes and repeats • Faster methods for tandem repeats • A linear-me soluon to the mulple common substring problem • And many-many more ... 1. Suffix tree 2. Suffix array 3. Some applications 4. Finding motifs

209 Suffixes - sorted hattivatti ε attivatti atti ttivatti attivatti tivatti hattivatti ivatti i vatti ivatti atti ti tti tivatti ti tti i ttivatti ε vatti

• Sort all suffixes. Allows to perform binary search! 210 Suffix array: example 1 hattivatti ε 11 11 2 attivatti atti 7 7 3 ttivatti attivatti 2 2 4 tivatti hattivatti 1 1 5 ivatti i 10 10 6 vatti ivatti 5 5 7 atti ti 9 9 8 tti tivatti 4 4 9 ti tti 8 8 10 i ttivatti 3 3 11 ε vatti 6

• suffix array = lexicographic order of the suffixes 211 Suffix array construcon: sort! hattivatti 1 11 11 attivatti 2 7 7 ttivatti 3 2 2 tivatti 4 1 1 ivatti 5 10 10 vatti 6 5 5 atti 7 9 9 tti 8 4 4 ti 9 8 8 i 10 3 3 ε 11 6

• suffix array = lexicographic order of the suffixes 212 Suffix array

• suffix array SA(T) = an array giving the lexicographic order of the suffixes of T • space requirement: 5|T| • praconers like suffix arrays (simplicity, space efficiency) • theorecians like suffix trees (explicit structure)

213 Reducing space: suffix array

=[2,2] A T =[3,3] C

=[5,6] A C T =[3,6] T A A A T =[6,6] =[4,6] T C C =[2,6] C T T T suffix 4 2 1 5 6 3 array

C A T A C T 1 2 3 4 5 6

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 214 Suffix array

• Many algorithms on suffix tree can be simulated using suffix array... – ... and couple of addional arrays... – ... forming so-called enhanced suffix array... – ... leading to the similar space requirement as careful implementaon of suffix tree • Not a sasfactory soluon to the space issue.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 215 Paern search from suffix array hattivatti ε 11 attivatti atti 7 att ttivatti attivatti 2 binary search tivatti hattivatti 1 ivatti i 10 vatti ivatti 5 atti ti 9 tti tivatti 4 ti tti 8 i ttivatti 3 ε vatti 6

216 What we learn today?

• We learn that it is possible to replace suffix trees with compressed suffix trees that take 8.8 GB for the human genome. • We learn that backtracking can be done using compressed suffix arrays requiring only 2.1 GB for the human genome. • We learn that discovering interesng mof seeds from the human genome takes 40 hours and requires 9.3 GB space.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 217 Recent suffix array construcons

• Manber&Myers (1990): O(|T|log|T|) • linear me via suffix tree • January / June 2003: direct linear me construcon of suffix array - Kim, Sim, Park, Park (CPM03) - Kärkkäinen & Sanders (ICALP03) - Ko & Aluru (CPM03)

218 Kärkkäinen-Sanders algorithm

1. Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively. 2. Construct the suffix array of the remaining suffixes using the result of the first step. 3. Merge the two suffix arrays into one.

219 Notaon

• string T = T[0,n) = t0t1 … tn-1

• suffix Si = T[i,0) = titi+1 … tn-1

• for C \subset [0,n]: SC = {Si | i in C}

• suffix array SA[0,n] of T is a permutaon of

[0,n] sasfying SSA[0] < SSA[1] < … < SSA[n]

220 Running example

0 1 2 3 4 5 6 7 8 9 10 11 • T[0,n) = y a b b a d a b b a d o 0 0 …

• SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)

221 Step 0: Construct a sample

• for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k} • C = B1 U B2 sample posions

• SC sample suffixes

• Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}

222 Step 1: Sort sample suffixes • for k = 1,2, construct

Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2]

R = R1 ^ R2 concatenaon of R1 and R2

Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ; correspondence is order preserving.

Sort the suffixes of R: radix sort the characters and rename with ranks to obtain R´. If all characters different, their order directly gives the order of suffixes. Otherwise, sort the suffixes of R´ using Kärkkäinen-Sanders. Note: |R´| = 2n/3.

223 Step 1 (cont.)

• once the sample suffixes are sorted, assign a rank to

each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0 • Example: R = [abb][ada][bba][do0][bba][dab][bad][o00] R´ = (1,2,4,6,4,5,3,7)

SAR´ = (8,0,1,6,4,2,5,3,7)

rank(Si) - 1 4 - 2 6 - 5 3 – 7 8 – 0 0

224 Step 2: Sort nonsample suffixes

• for each non-sample Si є SB0 (note that rank(Si+1) is always defined for i є B0): Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1))

• radix sort the pairs (ti,rank(Si+1)).

• Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)

225 Step 3: Merge • merge the two sorted sets of suffixes using a standard comparison-based merging:

• to compare Si є SC with Sj є SB0, disnguish two cases:

• i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2))

• note that the ranks are defined in all cases!

• S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)

226 Running me O(n)

• excluding the recursive call, everything can be done in linear me • the recursion is on a string of length 2n/3 • thus the me is given by recurrence T(n) = T(2n/3) + O(n) • hence T(n) = O(n)

227 Implementaon

• about 50 lines of C++ • code available e.g. via Juha Kärkkäinen’s home page

228 LCP table

• Longest Common Prefix of successive elements of suffix array: • LCP[i] = length of the longest common prefix

of suffixes SSA[i] and SSA[i+1] • build inverse array SA-1 from SA in linear me • then LCP table from SA-1 in linear me (Kasai et al, CPM2001)

229 • Oxford English Disconary hp://www.oed.com/ • Example - Word of the Day , Fourth hp://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.html hp://www.oed.com/cgi/display/wotd • PAT index - by Gaston Gonnet (ta on samu Maple tarkvara üks loojatest ning hiljem molekulaarbioloogia tarkvarapake väljatöötajaid) • PAT index is essenally a suffix array. To save space, indexed only from first character of every word • XML-tagging (or SGML, at that me!) also indexed • To mark certain fields of XML, the bit vectors were used. • Main concern - improve the speed of search on the CD - minimize random accesses. • For slow medium even 15-20 accesses is too slow... • G. H. Gonnet, R. A. Baeza-Yates, and T. Snider, Lexicographical indices for text: Inverted files vs. PAT trees, Technical Report OED-91-01, Centre for the New OED, University of Waterloo, 1991. Suffix tree vs suffix array

• suffix tree ó suffix array + LCP table

231 1. Suffix tree 2. Suffix array 3. Some applications

4. Finding motifs

232 Substring mofs of string T

• string T = t1 … tn in alphabet A. • Problem: what are the frequently occurring (ungapped) substrings of T? Longest substring that occurs at least q mes? • Thm: Suffix tree Tree(T) gives complete occurrence counts of all substring mofs of T in O(n) me (although T may have O(n2) substrings!)

233 Counng the substring mofs

• internal nodes of Tree(T) ↔ repeang substrings of T • number of leaves of the subtree of a node for string P = number of occurrences of P in T

234 Substring mofs of hava

vatti t i vatti 4 2 ti i atti i hattivatti 2 2 vatti ivatti 2 ti vatti vatti vatti tti atti tivatti hattivatti ttivatti attivatti

Counts for the O(n) maximal motifs shown 235 Finding repeats in DNA

• human chromosome 3 • the first 48 999 930 bases • 31 min cpu me (8 processors, 4 GB)

• Human genome: 3x109 bases • Tree(HumanGenome) feasible

236 Longest repeat?

Occurrences at: 28395980, 28401554r Length: 2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaat gctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaa catgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcat agtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacat acgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaat caccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgc cattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttctttt gagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagt aggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggct tttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatg taagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaagga agggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatc agatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagt atagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttc caattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatga gcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtatttt attctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgag actttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctct tttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgt gccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaa tttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtac gtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattt tattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatga gttagg 237 Ten occurrences? ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggat ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcc caagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagt agagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccg cccgcctcggcctcccaaagtgctgggattacaggcgt

Length: 277

Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125 In the reversed complement at: 17858493, 41463059, 42431718, 42580925

238 Using suffix trees: plagiarism

• find longest common substring of strings X and Y • build Tree(X$Y) and find the deepest node which has a leaf poinng to X and another poinng to Y

239 Using suffix trees: approximate matching

: inserons, deleons, changes

• STOCKHOLM vs TUKHOLMA

240 String distance/similarity funcons

STOCKHOLM vs TUKHOLMA STOCKHOLM_ _TU_ KHOLMA => 2 deletions, 1 insertion, 1 change

241 Approximate string matching

• A: S T O C K H O L M B: T U K H O L M A • minimum number of ’mutaon’ steps: a -> b a -> є є -> b …

• dID(A,B) = 5 dLevenshtein(A,B)= 4 • mutaon costs => probabilisc modeling • evaluaon by dynamic programming => alignment 242 Dynamic programming di,j = min(if ai=bj then di-1,j-1 else ∞, di-1,j + 1, di,j-1 + 1) = distance between i-prefix of A and j-prefix of B (substitution excluded) B mxn table d bj

di-1,j-1 di-1,j A +1 a d d i i,j-1 +1 i,j

dm,n 243 di,j = min(if ai=bj then di-1,j-1 else ∞, di-1,j + 1, di,j-1 + 1)

A\B s t o c k h o l m 0 1 2 3 4 5 6 7 8 9 t 1 2 1 2 3 4 5 6 7 8 u 2 3 2 3 4 5 6 7 8 9 k 3 4 3 4 5 4 5 6 7 8 h 4 5 4 5 6 5 4 5 6 7 o 5 6 5 4 5 6 5 4 5 6 l 6 7 6 5 6 7 6 5 4 5 m 7 8 7 6 7 8 7 6 5 4 a 8 9 8 7 8 9 8 7 6 5

optimal alignment by trace-back 244 dID(A,B)

Search problem

• find approximate occurrences of paern P in text T: substrings P’ of T such that d(P,P’) small • dyn progr with small modificaon: O(mn) • lots of (praccal) improvement tricks

T P’

P

245 Index for approximate searching?

• dynamic programming: P x Tree(T) with backtracking

Tree(T)

P

246