02.01.15

Exact paern matching

• S = s1 s2 … sn (text) |S| = n (length) Text Algorithms • P = p1p2..pm (paern) |P| = m

• Σ - alphabet | Σ| = c Jaak Vilo 2014 fall • Does S contain P? – Does S = S' P S" fo some strings S' ja S"?

– Usually m << n and n can be (very) large Jaak Vilo MTAT.03.190 Text Algorithms 1

Find occurrences in text Algorithms

P One-paern Mul-paern • Brute force • Aho Corasick S • Knuth-Morris-Pra • Commentz-Walter • Karp-Rabin • Shi-OR, Shi-AND • Boyer-Moore Indexing • • Factor searches (and suffix trie) • Suffix tree

Animaons Brute force: BAB in text?

• hp://www-igm.univ-mlv.fr/~lecroq/string/ ABACABABBABBBA

• EXACT STRING MATCHING ALGORITHMS BAB Animaon in Java • Chrisan Charras - Thierry Lecroq Laboratoire d'Informaque de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-Saint-Aignan Cedex FRANCE • e-mails: {Chrisan.Charras, Thierry.Lecroq} @laposte.net

1 02.01.15

Brute Force Brute force

attempt 1: gcatcgcagagagtatacagtacg P Algorithm Naive GCAg.... i i+j-1 attempt 2: Input: Text S[1..n] and gcatcgcagagagtatacagtacg S paern P[1..m] g......

attempt 3: Output: All posions i, where gcatcgcagagagtatacagtacg g...... j P occurs in S attempt 4: gcatcgcagagagtatacagtacg Idenfy the first mismatch! g......

for( i=1 ; i <= n-m+1 ; i++ ) attempt 5: gcatcgcagagagtatacagtacg for ( j=1 ; j <= m ; j++ ) g......

if( S[i+j-1] != P[j] ) break ; attempt 6: gcatcgcagagagtatacagtacg Queson: GCAGAGAG if ( j > m ) print i ; attempt 7: gcatcGCAGAGAGtatacagtacg

§Problems of this method? L g...... §Ideas to improve the search? J

Brute force or NaiveSearch C code

int bf_2( char* pat, char* text , int n ) /* n = textlen */ 1 funcon NaiveSearch(string s[1..n], string sub[1..m]) { int m, i, j ; 2 for i from 1 to n-m+1 int count = 0 ; m = strlen(pat); 3 for j from 1 to m for ( i=0 ; i + m <= n ; i++) { 4 if s[i+j-1] ≠ sub[j] for( j=0; j < m && pat[j] == text[i+j] ; j++) ; 5 jump to next iteraon of outer loop if( j == m ) 6 return i count++ ; } 7 return not found return(count); }

C code Main problem of Naive int bf_1( char* pat, char* text ) { P int m ; i i+j-1 int count = 0 ; char *tp; S

m = strlen(pat); j tp=text ; S

for( ; *tp ; tp++ ) { j if( strncmp( pat, tp, m ) == 0 ) { count++ ; } } • For the next possible locaon of P, check

return( count ); again the same posions of S }

2 02.01.15

D. Knuth, J. Morris, V. Pratt: Fast in strings. Goals Knuth-Morris-Pra SIAM Journal on Computing 6:323-350, 1977.

• Make sure that no comparisons “wasted” • Make sure only a constant nr of comparisons/ x operaons is made for each posion in S ≠ y – Move (only) from le to right in S

– How? • Aer such a mismatch we already know – Aer a test of S[i] <> P[j] what do we now ? exactly the values of green area in S !

D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. Knuth-Morris-Pra SIAM Journal on Computing 6:323-350, 1977. Automaton for ABCABD

• Make sure that no comparisons “wasted” NOT A

A B C A B D prefix x 1 2 3 4 5 6 7 ≠ prefix y ≠ p z

• P – longest suffix of any prefix that is also a prefix of a paern • Example: ABCABD ABCABD

Automaton for ABCABD KMP matching

NOT A Input: Text S[1..n] and paern P[1..m] Output: First occurrence of P in S (if exists) A B C A B D 1 2 3 4 5 6 7 i=1; j=1; initfail(P) // Prepare fail links repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern Fail: 0 1 1 1 2 3 1 else j = fail[j] // use fail link until j>m or i>n if j>m then report match at i-m Pattern: A B C A B D

1 2 3 4 5 6

3 02.01.15

Inializaon of fail links Inializaon of fail links i Algorithm: KMP_Iniail ABCABD i=1, j=0 , fail[1]= 0 j Input: Paern P[1..m] repeat Output: fail[] for paern P if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j Fail: 0 else j = fail[j] i=1, j=0 , fail[1]= 0 until i>=m repeat 0 1 if j==0 or P[i] == P[j] 0 1 1 1 then i++ , j++ , fail[i] = j else j = fail[j] ABCABD until i>=m

0 1 1 1 2

Time complexity of KMP matching? Analysis of me complexity

Input: Text S[1..n] and paern P[1..m] • At every cycle either i and j increase by 1 Output: First occurrence of P in S (if exists) • Or j decreases (j=fail[j]) i=1; j=1; initfail(P) // Prepare fail links repeat • i can increase n (or m) mes if j==0 or S[i] == P[j] • Q: How oen can j decrease? then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link – A: not more than nr of increases of i until j>m or i>n if j>m then report match at i-m • Amorsed analysis: O(n) , preprocess O(m)

Karp-Rabin Karp-Rabin R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260. IBM Journal of Research and Development 31 (1987), 249-260.

• Compare in O(1) a hash of P and S[i..i+m-1] • Compare in O(1) a hash of P and S[i..i+m-1]

i .. (i+m-1) i .. (i+m-1) i .. (i+m-1)

h(T[i.. i+m-1]) h(T[i+1..i+m])

h(P) h(P)

1..m 1..m • Goal: O(n). • Goal: O(n). • f( h(T[i.. i+m-1]) -> h(T[i+1..i+m]) ) = O(1) • f( h(T[i.. i+m-1]) -> h(T[i+1..i+m]) ) = O(1)

4 02.01.15

Hash Let’s use numbers

• “Remove” the effect of T[i] and “Introduce” • T= 57125677 the effect of T[i+m] – in O(1) • P= 125 (and for simplicity, h=125)

• Use base |Σ| arithmecs and treat charcters • H(T[1])=571 as numbers • H(T[2])= (571-5*100)*10 + 2 = 712

• In case of hash match – check all m posions • H(T[3]) = (H(T[2]) – ord(T[1])*10m)*10 + T[3+m-1] • Hash collisions => Worst case O(nm)

hash

• c – size of alphabet • hash(w[0 .. m-1])=(w[0]*2m-1+ w[1]*2m-2+···+ w[m-1]*20) mod q

• HSi = H( S[i..i+m-1] )

• H( S[i+1 .. i+m ] ) = ( HSi – ord(S[i])* cm-1 ) * c + ord( S[i+m] )

• Modulo arithmec – to fit value in a word!

Karp-Rabin More ways to ensure O( n ) ?

Input: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */ 2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0 5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position

8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q 10. if hp == hs and P == S[i..i+m-1] 11. report match at position i

5 02.01.15

Shi-AND / Shi-OR Bit-operaons

• Ricardo Baeza-Yates , Gaston H. Gonnet • Maintain a set of all prefixes that have so far A new approach to text searching had a perfect match Communicaons of the ACM October 1992, Volume 35 Issue 10 [ACM Digital Library:hp://doi.acm.org/10.1145/135239.135243] [DOI] • On the next character in text update all previous pointers to a new set • PDF

• Bit vector: for every possible character

State: which prefixes match? Move to next:

0 1

0 0

1

Track posions of prefix matches Vectors for every char in Σ

• P=aste

0 1 0 1 0 1 a s t e b c d .. z 1 0 1 0 1 1 Shift left << 1 0 0 0 0 ... Mask on char T[i] 1 0 0 0 1 1 Bitwise AND 0 1 0 0 0 ...

1 0 0 0 1 1 0 0 1 0 0 ... 0 0 0 1 0 ...

6 02.01.15

• T= lasteaed • T= lasteaed

l a s t e a e d l a s t e a e d 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

• T= lasteaed • T= lasteaed

l a s t e a e d l a s t e a e d 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0

Summary

Algorithm Worst case Ave. Case Preprocess

Brute force O(mn) O(n * (1+1/|Σ|+..)

Knuth-Morris-Pra O(n) O(n) O(m)

Rabin-Karp O(mn) O(n) O(m)

Boyer-Moore O(n/m) ?

BM Horspool

Factor search

Shi-OR O(n) O(n) O( m|Σ| )

7 02.01.15

Find occurrences in text

• R. Boyer, S.Moore: A fast string searching P algorithm. CACM 20 (1977), 762-772 [PDF] • http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf S

• Have we missed anything?

43 44

Find occurrences in text

P

S ABCDE BBCDE

• What have we learned if we test for a potenal match from the end?

45 46

Bad character heuriscs Find occurrences in text maximal shi on S[i]

P P S[i]

S A S A X B B

S X X First x in pattern (from end)

delta1( S[i] ) – |m| if pattern does not contain S[i] patlen-j max j so that P[j] == S[i]

47 48

8 02.01.15

Good suffix heuriscs

void bmInitocc() { P

char a; int j; S A µ for(a=0; a

49 50

Boyer-Moore algorithm

Input: Text S[1..n] and pattern P[1..m] • hp://www.i.-flensburg.de/lang/algorithmen/paern/bmen.htm Output: Occurrences of P in S

• hp://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Arcles/Exact/Boyer-Moore-original-p762-boyer.pdf preprocess_BM() // delta1 and delta2 i=m • Animaon: hp://www-igm.univ-mlv.fr/~lecroq/string/ while i <= n for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ; if j==0 report match at position i-m+1 i = i+ max( delta1[ S[i] ], delta2[ j ] )

51 52

Simplificaons of BM Algorithm BMH (Boyer-Moore-Horspool)

• There are many variants of Boyer-Moore, and • RN Horspool - Praccal Fast Searching in Strings Soware - Pracce and Experience, 10(6):501-506 1980 many scienfic papers. Input: Text S[1..n] and pattern P[1..m] • On average the me complexity is sublinear Output: occurrences of P in S 1. for a in Σ do delta[a] = m • Algorithm speed can be improved and yet 2. for j=1..m-1 do delta[P[j]] = m-j

simplify the code. 3. i=m 4. while i <= n • It is useful to use the last character heuriscs 5. if S[i] == P[m] 6. j = m-1 (Horspool (1980), Baeza-Yates(1989), Hume 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; and Sunday(1991)). 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ]

53 54

9 02.01.15

String Matching: Horspool algorithm Algorithm Boyer-Moore-Horspool-Hume-Sunday (BMHHS) • How the comparison is made? • Use delta in a ght loop Text : • If match (delta==0) then check and apply original delta d Pattern : From right to left: suffix search Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m • Which is the next position of the window? 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m] Text : a 4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 0 5. for ( i=m ; i<= n ; i = i+d ) Pattern : 6. repeat // skip loop It depends of where appears the last letter of the text, say it ‘a’, in the pattern: 7. t=delta[ S[i] ] ; i = i + t 8. until t==0 a a a 9. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ; 10. if j==0 report match at i-m+1 a a a a a a BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P (in order for the algorithm to finish correctly – at least one occurrence!). Then it is necessary a preprocess that determines the length of the shift. 56

• Daniel M. Sunday: A very fast search algorithm [PDF] Communicaons of the ACM August 1990, Volume 33 Issue 8

• Loop unrolling: • Avoid too many loops (each loop requires tests) by just repeang code within the loop. • Line 7 in previous algorithm can be replaced by:

7. i += delta[ S[i] ]; i += delta[ S[i] ]; i += (t = delta[ S[i] ]) ;

57 58

Forward-Fast-Search: Another Fast Variant of the Boyer-Moore String Matching Algorithm Factor based approach • The Prague Stringology Conference '03 • Opmal average-case algorithms • Domenico Cantone and Simone Faro • Abstract: We present a variaon of the Fast-Search string matching – Assuming independent characters, same probability algorithm, a recent member of the large family of Boyer-Moore-like algorithms, and we compare it with some of the most effecve string matching algorithms, such as Horspool, Quick Search, Tuned Boyer- Moore, Reverse Factor, Berry-Ravindran, and Fast-Search itself. All algorithms are compared in terms of run-me efficiency, number of text • Factor – a substring of a paern character inspecons, and number of character comparisons. It turns out that our new proposed variant, though not linear, achieves very good – Any substring results especially in the case of very short paerns or small alphabets. – (how many?) • hp://cs.felk.cvut.cz/psc/event/2003/p2.html • PS.gz (local copy)

59 60

10 02.01.15

Factor based approach Factor searches

• Opmal average-case algorithms P – Assuming independent characters, same probability u S X

Do not compare characters, but find the longest match to any subregion of the pattern.

61 62

Examples Backward DAWG Matching BDM recognises all factors (and suffixes) in O(n) • Backward DAWG Matching (BDM) – Crochemore et al 1994 • Backward Nondeterminisc DAWG Matching (BNDM) – Navarro, Raffinot 2000 • Backward Oracle Matching (BOM) – Allauzen, Crochermore, Raffinot 2001

Do not compare characters, but find the longest match to any 63 subregion of the pattern. 64

BNDM – simulate using bitparallelism BNDM matches an NDA Bits – show where the factors have occurred so far NDA on the suffixes of ‘announce’

65 66

11 02.01.15

Determinisc version of the same

BNDM – Backward Non-Deterministic DAWG Matching Backward Factor Oracle BOM - Backward Oracle matching

67 68

String Matching of one paern Mulple paerns

S

CTACTACTACGTCTATACTGATCGTAGC

TACTACGGTATGACTAA

1. Prefix search {P}

2. Suffix search

3. Factor search

Why? Algorithms

• Mulple paerns • Aho-Corasick (search for mulple words) • Highlight mulple different search words on the page – Generalizaon of Knuth-Morris-Pra • Virus detecon – filter for virus signatures • • Spam filters Commentz-Walter • Scanner in compiler needs to search for mulple keywords – Generalizaon of Boyer-Moore & AC • Filter out stop words or disallowed words • Wu and Manber • Intrusion detecon soware – improvement over C-W • Next-generaon sequencing produces huge amounts (many millions) of short reads (20-100 bp) that need to be mapped to genome! • Addional methods, tricks and techniques • …

12 02.01.15

Aho-Corasick (AC)

• Alfred V. Aho and Margaret J. Corasick (Bell Labs, Murray Hill, NJ) • Efficient string matching. An aid to bibliographic search. Generalizaon of KMP for many paerns Communicaons of the ACM, Volume 18 , Issue 6, p333-340 (June 1975) • Text S like before. • ACM:DOI PDF • ABSTRACT This paper describes a simple, efficient algorithm to locate all • Set of paerns P = { P1 , .., Pk } occurrences of any of a finite number of keywords in a string of text. The algorithm consists of construcng a finite state paern matching machine • Total length |P| = m = Σ i=1..k mi from the keywords and then using the paern matching machine to process the text string in a single pass. Construcon of the paern • Problem: find all occurrences of any of the matching machine takes me proporonal to the sum of the lengths of the keywords. The number of state transions made by the paern matching Pi ∈ P from S machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.

References:

Idea PATRICIA trie

• D. R. Morrison, "PATRICIA: Praccal Algorithm To Retrieve Informaon 1. Create an automaton from all paerns Coded In Alphanumeric", Journal of the ACM 15 (1968) 514-534. • Abstract PATRICIA is an algorithm which provides a flexible means of storing, indexing, and retrieving informaon in a large file, which is 2. Match the automaton economical of index space and of reindexing me. It does not require rearrangement of text or index as new material is added. It requires a minimum restricon of format of text and of keys; it is extremely flexible in the variety of keys it will respond to. It retrieves informaon in response to keys furnished by the user with a quanty of computaon which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several variaons as FORTRAN • Use the PATRICIA trie for creang the main structure programs for the CDC-3600, ulizing disk file storage of text. It has been of the automaton applied to several large informaon-retrieval problems and will be applied to others. • ACM:DOI PDF

• Word trie - a good to represent a set of words (e.g. a diconary). • trie (data structure)

• Definion: A tree for storing strings in which there is one node for every common preffix. The strings are stored in extra leaf nodes. • See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree. • Note: The name comes from retrieval and is pronounced, "tree." • To test for a word p, only O(|p|) me is used no maer how many words are in the diconary...

13 02.01.15

Trie for P={he, she, his, hers} Trie for P={he, she, his, hers}

0

h s 0 0

h h s 1 3 i e h 1 1 3

e e h 2 6 4 r s e 2 2 4

e 8 5 s 7

5 9

How to search for words like he, sheila, hi. Do these occur in the trie? Aho-Corasick

0 1. Create an automaton MP for a set of strings P. h s 2. Finite state machine: read a character from text,

1 3 and change the state of the automaton based on i the state transions... e h 3. Main links: goto[j,c] - read a character c from text 2 6 4 and go from a state j to state goto[j,c]. r s e 4. If there are no goto[j,c] links on character c from

8 state j, use fail[j]. 7 5 s 5. Report the output. Report all words that have been found in state j. 9

AC Automaton (vs KMP) AC - matching

NOT { h, s } Input: Text S[1..n] and an AC automaton M for paern set P goto[1,i] = 6. ; Output: Occurrences of paerns from P in S (last posion) fail[7] = 3, 0 1. state = 0 fail[8] = 0 , h s fail[5]=2. 2. for i = 1..n do 1 3 e i h 3. while (goto[state,S[i]]== ∅ ) and (fail[state]!=state) do 2 Output table 4 4. state = fail[state] r state output[j] e 5. state = goto[state,S[i]] 2 he 6 5 she, he 8 5 s s 6. if ( output[state] not empty ) 7 his 9 hers 7. then report matches output[state] at posion i 9 7

14 02.01.15

Algorithm Aho-Corasick preprocessing I (TRIE) Input: P = { P , ... , P } Preprocessing II for AC (FAIL) 1 k Output: goto[] and paral output[] Assume: output(s) is empty when a state s is created; queue = ∅ goto[s,a] is not defined. for a ∈ Σ do

if goto[0,a] ≠ 0 then procedure enter(a1, ... , am) /* Pi = a1, ... , am */ begin begin 10. news = 0 enqueue( queue, goto[0,a] ) fail[ goto[0,a] ] = 0 1. s=0; j=1; 11. for i=1 to k do enter( Pi ) 12. for a ∈ Σ do 2. while goto[s,aj] ≠ ∅ do // follow exisng path 3. s = goto[s,a ]; while queue ≠ ∅ j ∅ 4. j = j+1; 13. if goto[0,a] = then goto[0,a] = 0 ; r = take( queue ) end 5. for p=j to m do // add new path (states) for a ∈ Σ do

6. news = news+1 ; if goto[r,a] ≠ ∅ then s = goto[ r, a ]

7. goto[s,ap] = news; enqueue( queue, s ) // breadth first search 8. s = news; state = fail[r]

9. output[s] = a1, ... , am while goto[state,a] = ∅ do state = fail[state] end fail[s] = goto[state,a] output[s] = output[s] + output[ fail[s] ]

Correctness AC matching me complexity

• Theorem For matching the MP on text S, |S|=n, • Let string t "point" from inial state to state j. less than 2n transions within M are made. • Proof Compare to KMP. • Must show that fail[j] points to longest suffix • There is at most n goto steps. that is also a prefix of some word in P. • Cannot be more than n Fail-steps. • In total -- there can be less than 2n transions in • Look at the arcle... M.

Individual node (goto) AC thoughts

• Full table • Scales for many strings simultaneously. • For very many paerns – search me (of grep) improves(??) – See Wu-Manber arcle • List • When k grows, then more fail[] transions are made (why?) • But always less than n. • If all goto[j,a] are indexed in an array, then the size is

• Binary search tree(?) |MP|*|Σ|, and the running me of AC is O(n). • When k and c are big, one can use lists or trees for storing transion funcons. • Some other index? • Then, O(n log(min(k,c)) ).

15 02.01.15

Advanced AC Problems of AC?

• Precalculate the next state transion correctly for every possible character in alphabet • Need to rebuild on adding / removing paerns

• Can be good for short paerns • Details of branching on each node(?)

C-W descripon Commentz-Walter

• Generalizaon of Boyer-Moore for mulple • Aho and Corasick [AC75] presented a linear-me sequence search algorithm for this problem, based on an automata • Beate Commentz-Walter approach. This algorithm serves as the basis for the A String Matching Algorithm Fast on the Average UNIX tool fgrep. A linear-me algorithm is opmal in Proceedings of the 6th Colloquium, on Automata, the worst case, but as the regular string-searching Languages and Programming. Lecture Notes In algorithm by Boyer and Moore [BM77] Computer Science; Vol. 71, 1979. pp. 118 - 132, demonstrated, it is possible to actually skip a large Springer-Verlag poron of the text while searching, leading to faster than linear algorithms in the average case. • hp://www.-albsig.de/win/personen/professoren.php?RID=36 • You can download here my algorithm StringMatchingFastOnTheAverage (PDF, ~17,2 MB) or here StringMatchingFastOnTheAverage (extended abstract) (PDF, ~3 MB)

Commentz-Walter [CW79] Idea of C-W

• Commentz-Walter [CW79] presented an algorithm for the • Build a backward trie of all keywords mul-paern matching problem that combines the Boyer- Moore technique with the Aho-Corasick algorithm. The Commentz-Walter algorithm is substanally faster than the • Match from the end unl mismatch... Aho-Corasick algorithm in pracce. Hume [Hu91] designed a tool called gre based on this algorithm, and version 2.0 of fgrep by the GNU project [Ha93] is using it. • Determine the shi based on the combinaon • Baeza-Yates [Ba89] also gave an algorithm that combines the of heuriscs Boyer-Moore-Horspool algorithm [Ho80] (which is a slight variaon of the classical Boyer-Moore algorithm) with the Aho-Corasick algorithm.

16 02.01.15

Horspool for many paerns Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A 1. Build the trie of the inverted patterns T G G T A

T G T A T A A T A A A 1 T A C 4 (lmin) G G T A The text ACATGCTATGTGACA… G 2 T A A T A T 1 A 2. lmin=4 A 1 3. Table of shifts C 4 (lmin) G 2 T 1 4. Start the search

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Horspool for many paerns Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG Search for ATGTATG,TATG,ATAAT,ATGTG T G T A T G T A A A T T G G T A G G T A T A A T A A T A A 1 T A A 1 A C 4 (lmin) A C 4 (lmin) The text ACATGCTATGTGACA… G 2 The text ACATGCTATGTGACA… G 2 T 1 T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Horspool for many paerns Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG Search for ATGTATG,TATG,ATAAT,ATGTG T G T A T G T A A A T T G G T A G G T A T A A T A A T A A 1 T A A 1 A C 4 (lmin) A C 4 (lmin) The text ACATGCTATGTGACA… G 2 The text ACATGCTATGTGACA… G 2 T 1 T 1

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

17 02.01.15

Horspool for many paerns Horspool for many paerns Search for ATGTATG,TATG,ATAAT,ATGTG Search for ATGTATG,TATG,ATAAT,ATGTG T G T A T G T A A A T T G G T A G G T A T A A T A A T A A 1 T A A 1 A C 4 (lmin) A C 4 (lmin) The text ACATGCTATGTGACA… G 2 The text ACATGCTATGTGACA… G 2 T 1 T 1 Short Shifts!

… …

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

What are the possible limitaons Wu-Manber for C-W? • Wu S., and U. Manber, "A Fast Algorithm for Mul-Paern Searching," Technical Report TR-94-17, Department of Computer Science, University • Many paerns, small alphabet – minimal skips of Arizona (May 1993). • Citeseer: hp://citeseer.ist.psu.edu/wu94fast.html [Postscript] • We present a different approach that also uses the ideas of Boyer and Moore. Our algorithm is quite simple, and the main engine of it is given • What can be done differently? later in the paper. An earlier version of this algorithm was part of the second version of agrep [WM92a, WM92b], although the algorithm has not been discussed in [WM92b] and only briefly in [WM92a]. The current version is used in glimpse [MW94]. The design of the algorithm concentrates on typical searches rather than on worst-case behavior. This allows us to make some engineering decisions that we believe are crucial to making the algorithm significantly faster than other algorithms in pracce.

Key idea

• Main problem with Boyer-Moore and many • Instead of looking at characters from the text one by paerns is that, the more there are paerns, one, we consider them in blocks of size B. the shorter become the possible shis... • logc2M; in pracce, we use either B = 2 or B = 3. • The SHIFT table plays the same role as in the regular Boyer-Moore algorithm, except that it determines • Wu and Manber: check several characters the shi based on the last B characters rather than simultaneously, i.e. increase the alphabet. just one character.

18 02.01.15

Horspool to Wu-Manber Wu-Manber algorithm How do we can increase the length of the shifts? Search for ATGTATG,TATG,ATAAT,ATGTG T G T A A T With a table shift of l-mers with the patterns G G T A ATGTATG,TATG,ATAAT,ATGTG T A A T A AA 1 1 símbol A AT 1 2 símbols GT 1 A 1 into the text: ACATGCTATGTGACATAATA TA 2 C 4 (lmin) AA 1 G 2 AC 3 (LMIN-L+1) TG 2 T 1 AG 3 AT 1 AA 1 CA 3 AT 1 CC 3 GT 1 CG 3 TA 2 … TG 2 …

Experimental length: log|Σ| 2*lmin*r Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Backward Oracle

• Set Backwards oracle SBDM, SBOM

• Pages 68-72

String matching of many paerns String matching of many paerns | Σ| | Σ| 8 (5 patterns) 8 (5 patterns) Wu-Manber Wu-Manber 4 4 SBOM SBOM 2 Lmin 2 Lmin 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45

8 8 Wu-Manber Wu-Manber (10 patterns) (10 patterns) 4 4 SBOM SBOM 2 2 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 (1000 patterns) 8 Wu-Manber 8 SBOM (100 patterns) (100 patterns) 4 4 SBOM 2 2 Slides courtesy of: 5Xavier Messeguer 10 Peypoch (http:// 15www.lsi.upc.es /~alggen 20 ) 25 30 35 40 45 Slides courtesy of: 5Xavier Messeguer 10 Peypoch (http:// 15www.lsi.upc.es /~alggen 20 ) 25 30 35 40 45

19 02.01.15

5 strings 10 strings

100 strings 1000 strings

Factor Oracle Factor Oracle: safe shi

20 02.01.15

Factor Oracle: Factor oracle

Shift to match prefix of P2?

Construcon of factor Oracle Factor oracle

• Allauzen, C., Crochemore, M., and Raffinot, M. 1999. Factor Oracle: A New Structure for Paern Matching. In Proceedings of the 26th Conference on Current Trends in theory and Pracce of informacs on theory and Pracce of informacs (November 27 - December 04, 1999). J. Pavelka, G. Tel, and M. Bartosek, Eds. Lecture Notes In Computer Science, vol. 1725. Springer- Verlag, London, 295-310. • hp://portal.acm.org/citaon.cfm? id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=6 1811641#

• hp://www-igm.univ-mlv.fr/~allauzen/work/sofsem.ps

21 02.01.15

So far Mulple Shi-AND

• Generalised KMP -> AhoCorasick • P={P1, P2, P3, P4}. Generalize Shi-AND • Generalised Horspool -> CommentzWalter, WuManber • Bits = P4 P3 P2 P1 • BDM, BOM -> Set Backward Oracle Matching… • Start = 1 1 1 1

• Other generalisaons? • Match= 1 1 1 1

Problem

• Given P and S – find all exact or approximate Text Algorithms (6EAP) occurrences of P in S

Full text indexing • You are allowed to preprocess S (and P, of course) Jaak Vilo 2010 fall • Goal: to speed up the searches

Jaak Vilo MTAT.03.190 Text Algorithms 129

E.g. Diconary problem Sorted array and binary search

• Does P belong to a diconary D={d1,…,dn} 13 – Build a binary search tree of D 1 – B-Tree of D – Hashing global info – Sorng + Binary search head search stop • he Build a keyword trie: search in O(|P|) header informal hers – Assuming alphabet has up to a constant size c index happy his show – See Aho-Corasick algorithm, Trie construcon

22 02.01.15

Sorted array and binary search Trie for D={ he, hers, his, she}

0

1 13 h s

1 3 i e h

global info search head 2 6 4 he stop r s header informal e hers index happy his show 8 5 s 7

O( |P| log n ) 9 O( |P| )

S != set of words

• S of length n Trie(babaab)

a b • How to index? babaab b a abaab a a b baab b • Index from every posion of a text aab a a a ab b b a b • Prefix of every possible suffix is important b

Suffix tree Literature on suffix trees • • Definion: A compact representaon of a trie corresponding to the hp://en.wikipedia.org/wiki/Suffix_tree suffixes of a given string where all nodes with one child are merged with • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer their parents. Science and Computaonal Biology. Hardcover - 534 pages 1st edion • Definion ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: rooted, labeled tree with a leaf for each non-empty suffix of S. 89--208) Furthermore, a suffix tree sasfies the following properes: • E. Ukkonen. On-line construcon of suffix trees. Algorithmica, 14:249-60, • Each internal node, other than the root, has at least two children; 1995. hp://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751 • • Each edge leaving a parcular node is labeled with a non-empty substring Ching-Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Construcng Suffix Tree of S of which the first symbol is unique among all first symbols of the edge for Gigabyte Sequences with Megabyte Memory," IEEE Transacons on labels of the edges leaving this parcular node; Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-105, January, 2005. hp://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3 • For any leaf in the tree, the concatenaon of the edge labels on the path • CPM arcles archive: hp://www.cs.ucr.edu/~stelo/cpm/ from the root to this leaf exactly spells out a non-empty suffix of s. • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and • Mark Nelson. Fast String Searching With Suffix Trees Dr. Computaonal Biology. Hardcover - 534 pages 1st edion (January 15, 1997). Dobb's Journal, August, 1996. Cambridge Univ Pr (Short); ISBN: 0521585198. hp://www.dogma.net/markn/arcles/suffixt/suffixt.htm

23 02.01.15

• STXXL : Standard Template Library for Extra Large Data Sets. http://stackoverflow.com/questions/9452701/ • hp://stxxl.sourceforge.net/ ukkonens-suffix-tree-algorithm-in-plain- english • ANSI C implementaon of a Suffix Tree • hp://yeda.cs.technion.ac.il/~yona/suffix_tree/

These are old links – check for newer…

The suffix tree Tree(T) of T Suffix trie and suffix tree

abaab • data structure suffix tree, Tree(T), is baab compacted trie that represents all the suffixes aab of string T ab b • linear size: |Tree(T)| = O(|T|) Trie(abaab) Tree(abaab)

• can be constructed in linear me O(|T|) a b a baab b • has myriad virtues (A. Apostolico) a a ab a baab • is well-known: 366 000 Google hits b a a b b

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt141 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt142

Algorithms for combinatorial string Suffix tree and suffix array matching? techniques for paern analysis in • deep beauty? +- strings • shallow beauty? + • applicaons? ++ • intensive algorithmic miniatures Esko Ukkonen • sources of new problems: Univ Helsinki text processing, DNA, music,… Erice School 30 Oct 2005

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt143 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt144

24 02.01.15

ttttttttttttttgagacggagtctcgctctg tcgcccaggctggagtgcagtggcggg High-throughput genome-scale atctcggctcactgcaagctccgcctcc sequence analysis and mapping cgggttcacgccattctcctgcctcagcc using compressed data tcccaagtagctgggactacaggcgcc Veli Mäkinen structures cgccactacgcccggctaattttttgtattt Department of Computer Science ttagtagagacggggtttcaccgttttagc University of Helsinki cgggatggtctcgatctcctgacctcgtg atccgcccgcctcggcctcccaaagtgc tgggattacaggcgt E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt146

Soluon: backtracking with suffix Analysis of a string of symbols tree • T = h a t t i v a t t i ’text’ • P = a t t ’paern’

• Find the occurrences of P in T: h a t t i v a t t i

• Paern synthesis: #(t) = 4 #(a) = 2

#(t****t) = 2 ...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG.....

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt147 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 148

Paern finding & synthesis problems

• T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet 1. • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different 2. paerns P can be found fast – stac text, any given paern P 3. Some applications • Paern synthesis problem: Learn from T new paerns that occur surprisingly oen 4. Finding motifs

• What is a paern? Exact substring, approximate substring, with generalized symbols, with gaps, … 149 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt150

25 02.01.15

The suffix tree Tree(T) of T Suffix trie and suffix tree

abaab • data structure suffix tree, Tree(T), is baab compacted trie that represents all the suffixes aab of string T ab b • linear size: |Tree(T)| = O(|T|) Trie(abaab) Tree(abaab)

• can be constructed in linear me O(|T|) a b a baab b • has myriad virtues (A. Apostolico) a a ab a baab • is well-known: 366 000 Google hits b a a b b

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt151 152

Trie(T) can be large Tree(T) is of linear size

• |Trie(T)| = O(|T|2) • only the internal branching nodes and the • bad example: T = anbn leaves represented explicitly • Trie(T) can be seen as a DFA: language • edges labeled by of T accepted = the suffixes of T • v = node(α) if the path from root to v spells α • minimize the DFA => directed cyclic word • one-to-one correspondence of leaves and graph (’DAWG’) suffixes • |T| leaves, hence < |T| internal nodes • |Tree(T)| = O(|T| + size(edge labels))

153 154

Tree(hava) Tree(hava) hattivatti hattivatti vatti vatti attivatti attivatti t i vatti t i vatti ttivatti ttivatti ti i ti i tivatti atti i tivatti atti i vatti vatti ivatti hattivatti ivatti hattivatti ivatti ivatti vatti ti vatti ti vatti vatti vatti vatti atti vatti tti atti vatti tti tti atti tti atti tivatti tivatti ti hattivatti ttivatti ti hattivatti ttivatti attivatti attivatti i i substring labels of edges represented as hattivatti pairs of pointers 155 156

26 02.01.15

Tree(hava) Tree(T) is full text index hattivatti vatti Tree(T) attivatti i 3,3 6 ttivatti 4,5 10 P tivatti 2,5 i vatti ivatti hattivatti P occurs in T at locations 8, 31, … vatti 9 5 6,10 vatti atti 6,10 8 tti 7 31 8 4 ti 1 3 2 P occurs in T ó P is a prefix of some suffix of T i ó Path for P exists in Tree(T) hattivatti All occurrences of P in time O(|P| + #occ) 157 158

Find a from Tree(hava) Linear me construcon of Tree(T) hattivatti vatti attivatti hattivatti t vatti ttivatti attivatti i ti ttivatti tivatti atti i vatti ivatti hattivatti tivatti ivatti vatti ti ivatti Weiner McCreight vatti vatti vatti (1973), (1976) atti vatti tti atti tti atti ’algorithm 7 tivatti tti ti hattivatti ttivatti of the attivatti ti year’ i 2 ’on-line’ algorithm i

159 (Ukkonen 1992) 160

On-line construcon of Trie(T) Trie(abaab)

Trie(a) Trie(ab) Trie(aba) • T = t1t2 … tn$ abaa a a b a b baa • Pi = t1t2 … ti i:th prefix of T b b aa a εa • on-line idea: update Trie(Pi) to Trie(Pi+1) a ε • => very simple construcon chain of links connects the end points of current suffixes

161 162

27 02.01.15

Trie(abaab) Trie(abaab)

Trie(abaa) Trie(abaa)

a a b a b a b a a b a b a b b b b b b b a a a a a a a a a a a a a a

Add next symbol = b

163 164

Trie(abaab) Trie(abaab)

Trie(abaa)

a a b a b a b a a b a b a b b b b b b b a a a a a a a a a a a a a a Trie(abaab) a b b Add next symbol = b a a a b a From here on b-arc already exists a b b

165 166

What happens in Trie(Pi) => Trie(Pi+1) ? What happens in Trie(Pi) => Trie(Pi+1) ?

Before • me: O(size of Trie(T)) a From here on the ai-arc i exists already => stop • suffix links: updating here slink(node(aα)) = node(α)

After ai ai ai ai ai New suffix links New nodes

167 168

28 02.01.15

On-line procedure for suffix trie Suffix trees on-line

1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix • ’compacted version’ of the on-line trie links slink(v) := root and slink(root) := root construcon: simulate the construcon on the 2. for i := 2 to n do begin linear size tree instead of the trie => me O(| 3. v := leaf of Trie(t …t ) for string t …t (i.e., the deepest leaf) i-1 1 i-1 1 i-1 T|) 4. v := vi-1; v´ := 0

5. while node v has no outgoing arc for ti do begin • all trie nodes are conceptually sll needed =>

6. Create a new node v´´ and an arc son(v,ti) = v´´ implicit and real nodes 7. if v´ ≠ 0 then slink(v) := v´´ 8. v := slink(v); v´ := v´´ end

9. for the node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´

169 170

Implicit and real nodes Implicit node

• Pair (v,α) is an implicit node in Tree(T) if v is a node of Tree and α is a (proper) prefix of the label of some arc from v. If α is the empty … α´ string then (v, α) is a ’real’ node (= v). • v Let v = node(α´) in Tree(T). Then implicit node … (v, α) represents node(α´α) of Trie(T) α (v, α)

171 172

Suffix links and open arcs Big picture

root suffix link path traversed: total work O(n) α … aα …

slink(v) v … … label [i,*] instead of [i,j] if … … … w is a leaf and j is the scanned position of T

w new arcs and nodes created:

173 total work O(size(Tree(T)) 174

29 02.01.15

Suffix-tree on-line: main procedure On-line procedure for suffix tree

Create Tree(t1); slink(root) := root (v, α) := (root, ε) /* (v, α) is the start node */ Input: string T = t t … t $ 1 2 n for i := 2 to n+1 do Output: Tree(T) v´ := 0

Notation: son(v,α) = w iff there is an arc from v to w with label α while there is no arc from v with label prefix αti do son(v,ε) = v if α ≠ ε then /* divide the arc w = son(v, αη) into two */

son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else Function Canonize(v, α): son(v,t ) := v´´´; v´´ := v while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do i if v´ ≠ 0 then slink(v´) := v´´ v := son(v, α´); α := α´´ v´ := v´´; v := slink(v); (v, α) := Canonize(v, α) return (v, α) if v´ ≠ 0 then slink(v´) := v

(v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */ 175 176

The actual me and space

• |Tree(T)| is about 20|T| in pracce • brute-force construcon is O(|T|log|T|) for random http://stackoverflow.com/questions/9452701/ strings as the average depth of internal nodes is ukkonens-suffix-tree-algorithm-in-plain- O(log|T|) english • difference between linear and brute-force construcons not necessarily large (Giegerich & Kurtz) • truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003) • alphabet independent linear me (Farach 1997)

178

abcabxabcd abc

30 02.01.15

Applicaons of Suffix Trees Back to backtracking

• Dan Gusfield: Algorithms on Strings, Trees, and Sequences: ACA, 1 mismatch Computer Science and Computaonal Biology. Hardcover - A T 534 pages 1st edion (January 15, 1997). Cambridge Univ Pr C (Short); ISBN: 0521585198. - book A T C A AT T A T C C C • APL1: Exact String Matching Search for P from text S. Soluon T T T 1: build STree(S) - one achieves the same O(n+m) as Knuth- 4 2 1 5 6 3 Morris-Pra, for example! • Search from the suffix tree is O(|P|) Same idea can be used to many other forms of approximate search, like • APL2: Exact set matching Search for a set of paerns P Smith-Waterman, position-restricted scoring matrices, search, etc.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 186

31 02.01.15

Applicaons of Suffix Trees Applicaons of Suffix Trees

• APL3: substring problem for a database of paerns • APL4: Longest common substring of two strings Given a set of strings S=S1, ... , Sn --- a database Find • Find the longest common substring of S and T. all Si that have P as a substring • Overall there are potenally O( n2 ) such substrings, • Generalized suffix tree contains all suffixes of all Si if n is the length of a shorter of S and T • Query in me O(|P|), and can idenfy the LONGEST • Donald Knuth once (1970) conjectured that linear- common prefix of P in all Si me algorithm is impossible. • Soluon: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring xxxxxx.

Simple analysis task: LCSS Suffix link

• Let LCSSA(A,B) denote the longest common substring two sequences A and B. E.g.: X – LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT. • A good soluon is to build suffix tree for the aX shorter sequence and make a descending suffix link suffix walk with the other sequence.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 189 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 190

Another common tool: Generalized Descending suffix walk suffix tree

suffix tree of A Read B left-to-right, A node info: always going down in the subtree size 47813871 tree when possible. C sequence count 87 If the next symbol of B does C not match any edge label on current position, take suffix link, and try again. (Suffix link in the root to itself emits a symbol). The node v encountered with largest string depth v is the solution. ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 191 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 192

32 02.01.15

Generalized suffix tree applicaon Case study connued

suffix tree of genome

motif? TAC...... T A node info: C subtree size 4398 5 blue blue sequences 12/15 1 red C red sequences 2/62 .....

...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#... genome ...#....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#...... #... regions with ChIP-seq matches

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 193 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 194

Properes of suffix tree... in Properes of suffix tree pracce • Suffix tree has n leaves and at most n-1 • Huge overhead due to pointer structure: internal nodes, where n is the total length of – Standard implementaon of suffix tree for human all sequences indexed. genome requires over 200 GB memory! • Each node requires constant number of – A careful implementaon (using log n -bit fields integers (pointers to first child, sibling, parent, for each value and array layout for the tree) sll requires over 40 GB. text range of incoming edge, stascs – counters, etc.). Human genome itself takes less than 1 GB using 2- bits per bp. • Can be constructed in linear me.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 195 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 196

Applicaons of Suffix Trees Applicaons of Suffix Trees

• APL4: Longest common substring of two strings • APL5: Recognizing DNA contaminaon Related • Find the longest common substring of S and T. to DNA sequencing, search for longest strings • Overall there are potenally O( n2 ) such substrings, (longer than threshold) that are present in the if n is the length of a shorter of S and T DB of sequences of other genomes. • Donald Knuth once (1970) conjectured that linear- • APL6: Common substrings of more than two me algorithm is impossible. strings Generalizaon of APL4, can be done in • Soluon: construct the STree(S+T) and find the node linear (in total length of all strings) me deepest in the tree that has suffixes from both S and T in subtree leaves. • Ex: S= superiorcalifornialives T= sealiver have both a substring alive.

33 02.01.15

Applicaons of Suffix Trees Applicaons of Suffix Trees

• APL7: Building a directed graph for exact • APL8: A reverse role for suffix trees, and major matching: Suffix graph - directed acyclic word space reducon Index the paern, not tree... graph (DAWG), a smallest finite state • Matching stascs. automaton recognizing all suffixes of a string • APL10: All-pairs suffix-prefix matching For all S. This automaton can recognize membership, pairs Si, Sj, find the longest matching suffix- but not tell which suffix was matched. prefix pair. Used in shortest common • Construcon: merge isomorfic subtrees. superstring generaon (e.g. DNA sequence • Isomorfic in Suffix Tree when exists suffix link assembly), EST alignmentm etc. path, and subtrees have equal nr. of leaves.

Applicaons of Suffix Trees Applicaons of Suffix Trees

• APL11: Finding all maximal repeve • APL14: Suffix trees in genome-scale projects structures in linear me • APL15: A Boyer-Moore approach to exact set • APL12: Circular string linearizaon e.g. circular matching chemical molecules in the database, one • APL16: Ziv-Lempel data compression wants to lienarize them in a canonical way... • APL17: Minimum length encoding of DNA • APL13: Suffix arrays - more space reducon will touch that separately

Applicaons of Suffix Trees Applicaons of Suffix Trees

• Addional applicaons Mostly exercises... • APL: Exact matching with wild cards • Extra feature: CONSTANT me lowest common ancestor retrieval (LCA) Andmestruktuur mis võimaldab leida konstantse ajaga alumist ühist • APL: The k-mismatch problem vanemat (see vastab pikimale ühisele prefixile!) on võimalik koostada lineaarse ajaga. • Approximate palindromes and repeats • APL: Longest common extension: a bridge to inexact matching • APL: Finding all maximal palindromes in linear me • Faster methods for tandem repeats Palindrome reads from central posion the same to le and right. E.g.: kirik, saippuakivikauppias. • A linear-me soluon to the mulple common • Build the suffix tree of S and inverted S (aabcbad => aabcbad#dabcbaa ) substring problem and using the LCA one can ask for any posion pair (i, 2i-1), the longest common prefix in constant me. • And many-many more ... • The whole problem can be solved in O(n).

34 02.01.15

Suffixes - sorted hattivatti ε attivatti atti 1. Suffix tree ttivatti attivatti tivatti hattivatti 2. Suffix array ivatti i vatti ivatti 3. Some applications atti ti tti tivatti ti tti 4. Finding motifs i ttivatti ε vatti

• Sort all suffixes. Allows to perform binary search! 205 206

Suffix array: example Suffix array construcon: sort! 1 hattivatti ε 11 11 hattivatti 1 11 11 2 attivatti atti 7 7 attivatti 2 7 7 3 ttivatti attivatti 2 2 ttivatti 3 2 2 4 tivatti hattivatti 1 1 tivatti 4 1 1 5 ivatti i 10 10 ivatti 5 10 10 6 vatti ivatti 5 5 vatti 6 5 5 7 atti ti 9 9 atti 7 9 9 8 tti tivatti 4 4 tti 8 4 4 9 ti tti 8 8 ti 9 8 8 10 i ttivatti 3 3 i 10 3 3 11 ε vatti 6 ε 11 6

• suffix array = lexicographic order of the suffixes • suffix array = lexicographic order of the suffixes 207 208

Suffix array Reducing space: suffix array

• suffix array SA(T) = an array giving the =[2,2] A T =[3,3] lexicographic order of the suffixes of T C

• space requirement: 5|T| =[5,6] A C T =[3,6] T A A A T =[6,6] =[4,6] T C C =[2,6] C • praconers like suffix arrays (simplicity, T T T space efficiency) suffix 4 2 1 5 6 3 array • theorecians like suffix trees (explicit structure) C A T A C T 1 2 3 4 5 6

209 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 210

35 02.01.15

Suffix array Paern search from suffix array hattivatti ε 11 • attivatti atti 7 Many algorithms on suffix tree can be att simulated using suffix array... ttivatti attivatti 2 binary search tivatti hattivatti 1 – ... and couple of addional arrays... ivatti i 10 – ... forming so-called enhanced suffix array... vatti ivatti 5 – ... leading to the similar space requirement as atti ti 9 careful implementaon of suffix tree tti tivatti 4 ti tti 8 • Not a sasfactory soluon to the space issue. i ttivatti 3 ε vatti 6

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 211 212

What we learn today? Recent suffix array construcons

• We learn that it is possible to replace suffix • Manber&Myers (1990): O(|T|log|T|) trees with compressed suffix trees that take • linear me via suffix tree 8.8 GB for the human genome. • January / June 2003: direct linear me • We learn that backtracking can be done using construcon of suffix array compressed suffix arrays requiring only 2.1 GB - Kim, Sim, Park, Park (CPM03) for the human genome. - Kärkkäinen & Sanders (ICALP03) • We learn that discovering interesng mof - Ko & Aluru (CPM03) seeds from the human genome takes 40 hours and requires 9.3 GB space.

ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 213 214

Kärkkäinen-Sanders algorithm Notaon

1. Construct the suffix array of the suffixes • string T = T[0,n) = t0t1 … tn-1 starting at positions i mod 3 ≠ 0. This is • done by reduction to the suffix array suffix Si = T[i,0) = titi+1 … tn-1 construction of a string of two thirds the • for C \subset [0,n]: SC = {Si | i in C} length, which is solved recursively. 2. Construct the suffix array of the remaining suffixes using the result of the first step. • suffix array SA[0,n] of T is a permutaon of

3. Merge the two suffix arrays into one. [0,n] sasfying SSA[0] < SSA[1] < … < SSA[n]

215 216

36 02.01.15

Running example Step 0: Construct a sample

0 1 2 3 4 5 6 7 8 9 10 11 • for k = 0,1,2 • T[0,n) = y a b b a d a b b a d o 0 0 … Bk = {i є [0,n] | i mod 3 = k} • C = B1 U B2 sample posions

• SA = (12,1,6,4,9,3,8,2,7,5,10,11,0) • SC sample suffixes

• Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}

217 218

Step 1: Sort sample suffixes Step 1 (cont.) • for k = 1,2, construct • once the sample suffixes are sorted, assign a rank to Rk = [t t t ] [t t t ]… [t t t ] k k+1 k+2 k+3 k+4 k+5 maxBk maxBk+1 maxBk+2 each: rank(S ) = the rank of S in S ; rank(S ) = i i C n+1 rank(S ) = 0 R = R1 ^ R2 concatenaon of R1 and R2 n+2 • Example:

Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ; R = [abb][ada][bba][do0][bba][dab][bad][o00] correspondence is order preserving. R´ = (1,2,4,6,4,5,3,7) Sort the suffixes of R: radix sort the characters and rename with SAR´ = (8,0,1,6,4,2,5,3,7) ranks to obtain R´. If all characters different, their order directly rank(S ) - 1 4 - 2 6 - 5 3 – 7 8 – 0 0 gives the order of suffixes. Otherwise, sort the suffixes of R´ using i Kärkkäinen-Sanders. Note: |R´| = 2n/3.

219 220

Step 2: Sort nonsample suffixes Step 3: Merge • merge the two sorted sets of suffixes using a standard • for each non-sample Si є SB0 (note that comparison-based merging:

rank(Si+1) is always defined for i є B0): • to compare Si є SC with Sj є SB0, disnguish two cases: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • i є B1: S ≤ S ↔ (t ,rank(S )) ≤ (t ,rank(S )) • radix sort the pairs (ti,rank(Si+1)). i j i i+1 j j+1 • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2))

• Example: S12 < S6 < S9 < S3 < S0 because (0,0) < • note that the ranks are defined in all cases! (a,5) < (a,7) < (b,2) < (y,1) • S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)

221 222

37 02.01.15

Running me O(n) Implementaon

• excluding the recursive call, everything can be • about 50 lines of C++ done in linear me • code available e.g. via Juha Kärkkäinen’s home • the recursion is on a string of length 2n/3 page • thus the me is given by recurrence T(n) = T(2n/3) + O(n) • hence T(n) = O(n)

223 224

LCP table

• Longest Common Prefix of successive • Oxford English Disconary hp://www.oed.com/ • Example - Word of the Day , Fourth elements of suffix array: hp://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.html hp://www.oed.com/cgi/display/wotd • LCP[i] = length of the longest common prefix • PAT index - by Gaston Gonnet (ta on samu Maple tarkvara üks loojatest ning hiljem molekulaarbioloogia tarkvarapake väljatöötajaid) of suffixes SSA[i] and SSA[i+1] • PAT index is essenally a suffix array. To save space, indexed only from • build inverse array SA-1 from SA in linear me first character of every word • XML-tagging (or SGML, at that me!) also indexed • then LCP table from SA-1 in linear me (Kasai • To mark certain fields of XML, the bit vectors were used. • Main concern - improve the speed of search on the CD - minimize random et al, CPM2001) accesses. • For slow medium even 15-20 accesses is too slow... • G. H. Gonnet, R. A. Baeza-Yates, and T. Snider, Lexicographical indices for text: Inverted files vs. PAT trees, Technical Report OED-91-01, Centre for 225 the New OED, University of Waterloo, 1991.

Suffix tree vs suffix array

1. Suffix tree • suffix tree ó suffix array + LCP table 2. Suffix array

3. Some applications

4. Finding motifs

227 228

38 02.01.15

Substring mofs of string T Counng the substring mofs

• string T = t1 … tn in alphabet A. • internal nodes of Tree(T) ↔ repeang • Problem: what are the frequently occurring substrings of T (ungapped) substrings of T? Longest substring • number of leaves of the subtree of a node for that occurs at least q mes? string P = number of occurrences of P in T • Thm: Suffix tree Tree(T) gives complete occurrence counts of all substring mofs of T in O(n) me (although T may have O(n2) substrings!)

229 230

Substring mofs of hava Finding repeats in DNA

vatti t i vatti • human chromosome 3 4 2 ti i • the first 48 999 930 bases atti i hattivatti 2 2 vatti • 31 min cpu me (8 processors, 4 GB) ivatti 2 ti vatti vatti 9 vatti tti • Human genome: 3x10 bases atti tivatti • Tree(HumanGenome) feasible hattivatti ttivatti attivatti

Counts for the O(n) maximal motifs shown 231 232

Longest repeat? Ten occurrences?

Occurrences at: 28395980, 28401554r Length: 2559 ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggat

ctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcc ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaat gctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaa caagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagt catgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcat agtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacat agagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccg acgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaat cccgcctcggcctcccaaagtgctgggattacaggcgt caccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgc cattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttctttt gagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagt aggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggct tttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatg taagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaagga Length: 277 agggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatc agatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagt atagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttc caattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatga Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, gcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtatttt 47398125 attctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgag actttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctct In the reversed complement at: 17858493, 41463059, 42431718, tttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgt 42580925 gccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaa tttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtac gtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattt tattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatga gttagg 233 234

39 02.01.15

Using suffix trees: approximate Using suffix trees: plagiarism matching • find longest common substring of strings X and Y • : inserons, deleons, changes • build Tree(X$Y) and find the deepest node which has a leaf poinng to X and another • STOCKHOLM vs TUKHOLMA poinng to Y

235 236

String distance/similarity funcons Approximate string matching

• A: S T O C K H O L M STOCKHOLM vs TUKHOLMA B: STOCKHOLM_ T U K H O L M A _TU_ • KHOLMA minimum number of ’mutaon’ steps: a -> b a -> є є -> b … => 2 deletions, 1 insertion, 1 change • dID(A,B) = 5 dLevenshtein(A,B)= 4 • mutaon costs => probabilisc modeling • evaluaon by dynamic programming => alignment 237 238

Dynamic programming di,j = min(if ai=bj then di-1,j-1 else ∞, di,j = min(if ai=bj then di-1,j-1 else ∞, di-1,j + 1, di-1,j + 1, di,j-1 + 1) di,j-1 + 1) = distance between i-prefix of A and j-prefix of B A\B s t o c k h o l m (substitution excluded) 0 1 2 3 4 5 6 7 8 9 B t 1 2 1 2 3 4 5 6 7 8 mxn table d bj u 2 3 2 3 4 5 6 7 8 9 k 3 4 3 4 5 4 5 6 7 8 h 4 5 4 5 6 5 4 5 6 7 d d i-1,j-1 i-1,j o 5 6 5 4 5 6 5 4 5 6 A +1 a d d l 6 7 6 5 6 7 6 5 4 5 i i,j-1 +1 i,j m 7 8 7 6 7 8 7 6 5 4 a 8 9 8 7 8 9 8 7 6 5

dm,n 239 optimal alignment by trace-back 240 dID(A,B)

40 02.01.15

Search problem Index for approximate searching?

• find approximate occurrences of paern P in • dynamic programming: P x Tree(T) with backtracking text T: substrings P’ of T such that d(P,P’) small • dyn progr with small modificaon: O(mn) Tree(T) • lots of (praccal) improvement tricks

P

T P’

P

241 242

41