Advanced Algorithms / T. Shibuya

Advanced Algorithms: Text Algorithms

Tetsuo Shibuya

Human Genome Center, Institute of Medical Science (Adjunct at Department of Computer Science) University of Tokyo http://www.hgc.jp/~tshibuya Self Introduction Advanced Algorithms / T. Shibuya  Affiliation:  Laboratory of Sequence Analysis, Human Genome Center, Institute of Medical Science  Adjunct at Department of Computer Science  Research Interest  Algorithms for bioinformatics/combinatorial /big data

Our lab is located at the 4th floor The topics of this part Advanced Algorithms / T. Shibuya  Text matching/indexing algorithms (today’s topic)  Knuth-Morris-Pratt / Boyer-Moore / suffix arrays / etc  Text communication algorithms  Hamming coding  Text compression algorithms  Huffman coding / arithmetic coding / block sorting / etc  Text models  Markov models / etc

An assignment of this part will be given on the last (i.e., the 3rd) week

- Submit 1 for Prof. Imai’s part (compulsory) AND - Submit 1 for 1 of the remaining 3 parts, i.e., - my part, Hirahara-san’s part, or May Szedak-san’s part Textbooks Advanced Algorithms / T. Shibuya  D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997.  The most famous book on text processing algorithms, but many parts are out of date.  W. Sung, Algorithms in Bioinformatics, CRC Press, 2009.  Good introduction for bioinformatics algorithms (mainly on text processing)  D. Salomon, G. Motta, Handbook of Data Compression, Springer, 2010.  T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 1991. Today's topic Advanced Algorithms / T. Shibuya Text matching algorithms Brute-force algorithm Rabin-Karp algorithm Knuth-Morris-Pratt algorithm Boyer-Moore algorithm Matching automaton Text Indexing algorithms Suffix arrays FM-index Text matching Advanced Algorithms / T. Shibuya

 Problem  Given  Text string T and a pattern (query) P  Output  of T that are exactly same as P, if any.  exact matching: no insertion / deletion / modification(mutation)  Two approaches: Matching and Indexing  Preprocess only the query pattern (matching)  Preprocess the text beforehand (indexing) - needs extra data structures

Text GGTGAGAAGTTATGATACAGGGTAGTTG TGTCCTTAAGGTGTATAACGATGACATC ACAGGCAGCTCTAATCTCTTGCTATGAG TGATGTAAGATTTATAAGTACGCAAATT

Pattern (Query) TATAA Two types of text matching algorithms Advanced Algorithms / T. Shibuya Brute-force Naive algorithm Fingerprinting (Hash-based) algorithm Rabin-Karp Skipping positions unnecessary to compare Check from left Knuth-Morris-Pratt Aho-Corasick (for multiple queries) Check from right Boyer-Moore Naive algorithm Advanced Algorithms / T. Shibuya Just check one by one at each position O(nm) in the worst case, but... Linear time in average! Not so bad for cases when you have no time to implement:-) But still it's much slower than other sophisticated algorithms in practice.

Text GGGACCAAGTTCCGCACATGCCGGATAGAAT c Pattern c CCGTATG c Average length to check c 2 Check one by one 1+1/4+(1/4) +... = 4/3 (constant!) CCg (for random DNA sequence) .... CCGt .... Rabin-Karp (1) Advanced Algorithms / T. Shibuya  Based on fingerprinting (i.e., hashing)  Check only if the position has the same fingerprint  All the text fingerprints can be obtained in linear time if we use an appropriate fingerprint e.g., hash(x[0..n-1]) = (x[0]dn-1 + x[1]dn-2 + x[2]dn-3 + … + x[n-1]) mod q q : some prime number Text

hash(T[0..|P|-1]) O(1) computation hash(T[1..|P|]) compare with hash(p) at first for each hash(T[2..|P|+1])

Pattern p → hash(p) Rabin-Karp (2) Advanced Algorithms / T. Shibuya

Text

11001101110100101...(16+8+1) mod 5 = 0 O(1) ((0-1·16)·2+1) mod 5 = 4 O(1) ((4-1·16)·2+0) mod 5 = 1 O(1) ((1-0·16)·2+1) mod 5 = 3 check → NO O(1) ((3-0·16)·2+1) mod 5 = 2 O(1) ((2-1·16)·2+1) mod 5 = 3 check → YES!

Pattern 10111 (16+4+2+1) mod 5 = 3 Knuth-Morris-Pratt(1) Advanced Algorithms / T. Shibuya  Another way of improvement of the brute-force algorithm  The brute-force algorithm sometimes checks the same position more than once, which could be a waste of time → Knuth-Morris-Pratt Algorithm

Text AATACTAGTAGGCATGCCGGAT t t Pattern TAg skip TAGTAGC t t Check from left TAGTAGc t skip We already know the text is "TAGTAG" and t cannot match with the pattern in these TAGt positions before comparison ... Knuth-Morris-Pratt (2) Advanced Algorithms / T. Shibuya  P[0..i] matches the text but P[i+1] does not, then  FailureLink[i+1]= max j s.t. P[0..j]≡P[i j..i], P[j+1]≠P[i+1], and j

− Longest match with the prefix Failed matching HERE

Should be different(←Knuth) Falure Link

Skip! You don't have to check these positions again! Knuth-Morris-Pratt (3) Advanced Algorithms / T. Shibuya

Text CTACTGATCTGATCGCTAGATGC CTGATCTGC Skip 1 position CTGATCTGC Failed at the first position, so just proceed Pattern CTGATCTGC Overlap of "CTG" CTGATCTGC No overlap CTGATCGC MP skips only 4 positions KMP skips 5 positions Knuth-Morris-Pratt (4) Advanced Algorithms / T. Shibuya Preprocessing A naive algorithm requires O(m2) or even O(m3) time Linear time algorithm exists Use the KMP itself Z algorithm [Gusfield 97]  Not faster than the KMP, but easier to understand Z Algorithm (1) Advanced Algorithms / T. Shibuya

Zi Compute it for all i (i >0) Longest common prefix length of S[0..n-1] and S[i..n-1]

Failure links can be easily obtained from Zi values

righti

Max value of x+Zx-1 (x≤i )

lefti

x that takes the maximum value of x+Zx-1 (x≤i ) Initialization Z Z box right0=left0=0 i lefti righti

0 Zi i Zleft_i Z Algorithm (2) Advanced Algorithms / T. Shibuya

 Computation of Zi +1

In case i +1≤righti

We have already computed until the position righti

 In case Zi < righti -i , we can copy the answer in O(1)

 Otherwise compare naively after the position righti ― ①

In case i +1>righti Compare naively ― ② ①+② can be done in linear time in total!

Zi+1=Zi'+1 Zi+1 0 righti-lefti lefti righti

i' i'+1 i i+1 Zleft_i Zleft_i Z Algorithm (3) Advanced Algorithms / T. Shibuya Example

Let's compute Zi for this position!

We have done to this position

Text ATGCGCATAATGCGCTGAATGGCCATAATCTGAA Zi 0000002016000000013000002012000011

The same text left right

Just copy the numbers if the numbers are smaller than 3 Computational of KMP Advanced Algorithms / T. Shibuya O(m+n) n: text length, m: pattern length Worst-case time complexity #comparison < 2n  If comparison succeeds, it will be never compared again. » i.e., at most n times  If comparison fails, i always increases » i.e., at most n-1 times But this algorithm requires access to all the positions in the text Can we reduce it? Boyer-Moore Algorithm Boyer-Moore (1) Advanced Algorithms / T. Shibuya Idea Almost the same as KMP, but check from right! Practically faster than KMP Better average-case time complexity butWorse worst-case time complexity Text

AATTGTTCCGGCCATGCCGGAT ...... T Pattern .....TT GTTCGTT ....GTT failed ...cGTT Skip based on the information of "GTT" gtt...t failed failed Skip based on the ....g.t information of "G" Boyer-Moore (2) Advanced Algorithms / T. Shibuya  Two rules  Bad character rule If the character at the failed position is x, we can move the last x in the pattern to the position The algorithm that uses only (a variation of) this rule is called Horspool Algorithm  (Strong) Good suffix rule Strong: the character before the same must be different  This constraint was not used in the original BM algorithm  cf. Knuth's rule in KMP  Do the larger shift of the above two

Success Success Failed Failed

Different = strong Boyer-Moore (3) Advanced Algorithms / T. Shibuya Bad character rule example

Pattern TTCCAAGTCGCC Do not consider the last character

Failed Text CCCTGTCCATGCCGTCAGCCC TTCCAAGTCGCC

TTCCAAGLastT TCGCC Boyer-Moore (4) Advanced Algorithms / T. Shibuya (Strong) Good suffix rule example

Pattern CGTATATCCAATATC

Failed Text AGTCCCTCGGTCCGATATCGACCCTCCCG CGTATATCCAATATC CGTATATCCAATATC Boyer-Moore (5) Advanced Algorithms / T. Shibuya Preprocess Bad character rule Very easy Good suffix rule Linear time by using the Z algorithm from backward Boyer-Moore (6) Advanced Algorithms / T. Shibuya Computational time complexity Average-case O(n/min (m, alphabet size)) i.e., average-case skip length is O(min(m, alphabet size)) Horspool algorithm has the same time complexity Worst-case O(nm) Bad for cases:  Many repeats » KMP is faster  Small alphabet size » Shift-Or is faster Linear time for finding only 1 occurrence  Good for grep in editors KMP and an automaton Advanced Algorithms / T. Shibuya The KMP can be represented by an automaton

A T A T T G

Failure link Aho-Corasick (1) Advanced Algorithms / T. Shibuya The automaton can be extended for multiple queries! Linear time construction! Linear time searching! T C

T Failure Link T

A G C

T G C

T T Link to the root if not specified Aho-Corasick (2) Advanced Algorithms / T. Shibuya  Construction of the keyword tree  O(M) time M: Sum of query string lengths Alphabet size: fixed  Can be used for dictionary searching T C

T T

A G C

T G C

T T Aho-Corasick (3) Advanced Algorithms / T. Shibuya Breadth-first searching Start from the root v  a No failure link at the root FailureLink(v) b Traverse FailureLinks of v's parent to find a node that have a a child w with the same label, w and let (the nearest) w be c FailureLink(v) If no such node exists, let FailureLink(v) = root a

b Aho-Corasick (4) Advanced Algorithms / T. Shibuya  Why it is linear time?

failure links to be made

1 shorter suffix

traverse at most O(m) nodes

Existing paths from the root in the tree

root

All the suffixes of some pattern Aho-Corasick (5) Advanced Algorithms / T. Shibuya  OutLink(v) Pointers to the nodes with the alphabet thatv must outputs  Computation of OutLink() Traverse the failure links to find a leaf if any If there's no such leaf, there's no need to set the outlink Also in linear time

o g e t h e r 1 1 together t t h e r 2 ether e 2 3 get e r 4 her h 4 g 5 he 5 e t 3 search based on automata (1) Advanced Algorithms / T. Shibuya Regular expression Concatenation A, B → AB Or A, B → A+B Repeat A → A* Extension of Aho-Corasick ABAB ABBB ABAABB AB(A+B)(AB+CD)*B ABACDB ABBABB ABBCDB ABAABABB ABAABCDB ... Regular expression search based on automata (2) Advanced Algorithms / T. Shibuya Construct the automaton for a regular expression A C (A*B+AC)D D End Start ε ε B

A

AB A+B A A* ε A B Next Next ε Next B A Regular expression search based on automata (3) Advanced Algorithms / T. Shibuya

O(nm) A C 0 4 D 5 6 End 2 ε B Start ε 7 8

3 A 1 CDAABCAAABDDACDAAC 000000000000000000 You can start anywhere Reachable nodes 113 11137 1 11 55 555 567556 (Not including ε Found! states) 8 8 DP But… Advanced Algorithms / T. Shibuya All the mentioned algorithms requires ( ) time, which means too slow for VERY BIG DATA 𝑂𝑂 𝑛𝑛 � 𝑓𝑓 𝑚𝑚

1E+16

1E+15 Size of the SRA database (10x in 18 months)

1E+14

1E+13 #bases 1E+12 Moore's Law (2x in 18 months)

1E+11

1E+10 2007 2008 2009 2010 2011 2012 2013 http://www.ncbi.nlm.nih.gov/sra SRA Database Size Keyword Tree (cf. Aho-Corasick Algorithm) Advanced Algorithms / T. Shibuya A tree for searching on a dictionary Linear-time constructible Linear-time searchable

T C

T T

A G C

T G C

T T Suffix Automata Advanced Algorithms / T. Shibuya  We can efficiently search for any substring of a given text if we have the keyword tree for all the suffixes of the text  O(n) substrings of length O(n) O(n2) space?  Add '$' at the end of the text, so that all the suffixes end at leaves '$' is a character that does not appear anywhere in the text

Suffix automata of 'mississippi$' s i All the mississippi$ p suffixes ississippi$ ssi si ssissippi$ $ issi i sissippi$ issippi$ i$ ssippi$ ppi$ pi$ sippi$ ippi$ ppi$ ppi$ ppi$ ppi$ pi$ i$ ssippi$ mississippi$ ssippi$ ssippi$ Suffix Trees Advanced Algorithms / T. Shibuya O(n) representation of the Eliminate nodes with only one child Represent substrings with their indices O(1) memory requirement for 1 node/edge e.g., T[3..5] = 'ssi'

Suffix tree of 'mississippi$' s i All the mississippi$ p suffixes ississippi$ ssi si ssissippi$ $ i$ pi$ i sissippi$ issippi$ ppi$ ssippi$ ppi$ sippi$ ppi$ ippi$ ppi$ ssippi$ ppi$ mississippi$ pi$ ssippi$ ssippi$ i$ construction algorithms Advanced Algorithms / T. Shibuya History Weiner '73 O(n s) (s: alphabet size) McCreight '76 O(n log s) Ukkonen '95 O(n log s) On-line computation of the McCreight Algorithm Farach '97 O(n) for the integer alphabet [1..n] Construction from suffix arrays An O(n) suffix array construction algorithm ([Kärkkäinen & Sanders '03] etc.) O(n) conversion from suffix arrays [Kasai et al '01] How to search for a number from an array of numbers Advanced Algorithms / T. Shibuya Find a number from an array Requires O(n) time if we do not pre-process the array

3 14 23 5 4 11 99 38 26 22 15 17 31 18 If the numbers are sorted... Advanced Algorithms / T. Shibuya O(log n) binary search Check the number in the middle If larger, see left. Otherwise see right. Repeat it!

3 14 23 5 4 11 99 38 26 22 15 17 31 18

Sort

3 4 5 11 14 15 17 18 22 23 26 31 38 99

① ③ ② Keyword Searching on a Sorted Keywords Advanced Algorithms / T. Shibuya Similarly, we can perform the binary search on a sorted keyword list This can be an alternative of the keyword tree

Sorted Dictionary

Pattern A little smarter searching Advanced Algorithms / T. Shibuya  A smarter binary search  Do not check unnecessary part of the keywords  The time complexity is the same, but practically it's much faster  It could be further improved to O(m+log n) with some additional data structure (or even O(m) for some special cases)

L Start comparison middle from here!

A Part same as R the query Suffix arrays Advanced Algorithms / T. Shibuya The sorted list of all the suffixes n suffixes of O(n) lengths O(n2) space? O(n) representation is possible! Represent the suffixes by their indices!  as the edges of the suffix trees

All 0: mississippi$ 10: i$ Suffixes 1: ississippi$ 7: ippi$ 2: ssissippi$ 4: issippi$ 3: sissippi$ 1: ississippi$ 4: issippi$ 0: mississippi$ 5: ssippi$ 9: pi$ 6: sippi$ Sort 8: ppi$ 7: ippi$ 6: sippi$ 8: ppi$ 3: sissippi$ 9: pi$ 5: ssippi$ 10: i$ 2: ssissippi$ SA vs ST Advanced Algorithms / T. Shibuya  Suffix arrays corresponds to the leaves of the suffix trees  So we can construct suffix arrays (in linear time) from suffix trees very easily in linear time  But note that we can also construct suffix trees from suffix arrays in linear time with Kasai et al's algorithm  Suffix arrays are much smaller  14n bytes vs 5n bytes (for 32 bit/English alphabet case)  Thus it is not good to build suffix arrays via suffix trees  Searching speed is not so different  Suffix arrays can achieve theoretically/practically almost the same performance as the suffix trees with some additional data structure(s)  Many applications  Not so different. Both have many applications.

s i 10: i$ p 7: ippi$ All the si 4: issippi$ mississippi$ ssi suffixes $ i$ pi$ i ississippi$ 1: ississippi$ ssissippi$ ppi$ 0: mississippi$ sissippi$ issippi$ ppi$ 9: pi$ ssippi$ ppi$ 8: ppi$ sippi$ ssippi$ 6: sippi$ ippi$ ppi$ 3: sissippi$ ppi$ ssippi$ mississippi$ pi$ ssippi$ 5: ssippi$ i$ 2: ssissippi$ Suffix Tree Suffix Array Kasai-Lee-Arimura-Park Algorithm (1/3) Advanced Algorithms / T. Shibuya  Height array  LCP (longest common prefix) lengths of adjacent suffixes in a suffix array  e.g. LCP between "ississippi" and "issippi" is 4  Small values  So we can store it in 1n byte memory or so  Many applications  Faster worst-case O(m+log n) search  Faster LCP computation between arbitrary suffixes  Linear-time construction of suffix trees from suffix arrays

10: i$ 1 7: ippi$ 1 4: issippi$ 4 1: ississippi$ 0 0: mississippi$ 0 Height Array 9: pi$ 1 8: ppi$ 0 6: sippi$ 2 3: sissippi$ 1 5: ssippi$ 3 2: ssissippi$ Kasai-Lee-Arimura-Park Algorithm (2/3) Advanced Algorithms / T. Shibuya  Linear-time height array construction  Just compute it in the order of the original positions Utilize the inversed suffix array

hi Ti

Use the result of the comparison for the previous suffix ↓ Ti+1 At most 2n comparison

hi T Ti This part exists at somewhere else on T Ti+1 Kasai-Lee-Arimura-Park Algorithm (3/3) Advanced Algorithms / T. Shibuya Linear-time suffix tree construction using the height array Just construct it from left!

7 12 12 7

12 12 12 4

16 7 16 Check from the leaf to 12 12 12 16 find where to insert next 16 Suffix Array Construction Algorithms (1992-2009) Advanced Algorithms / T. Shibuya  Naively from suffix trees  O(n) but requires too large memory (i.e., 14n byte)  No faster than direct construction algorithms (practically/theoretically)  Ternary quick sort  Bentley & Sedgewick '97  Fast for random texts (O(n log n))  Worst-case O(n2)  Doubling algorithm  Manber & Myers '92, Larsson and Sadakane '98.  Worst-case O(n log n)  Copy-based (BWT-like) algorithm  Itoh-Tanaka '99, Seward '00, Manzini-Ferragina '02, Schürmann and Stoye '05.  Small memory. worst-case O(n2 log n). Practically fast.  Divide and merge algorithm (Similar to Farach's algorithm)  Kärkkäinen & Sanders '03, Ko & Aluru '03, Kim et al. '03, Hon et al. '03, Na '05, Nong et al '09 (Induced sorting).  Worst-case O(n)  Burkhardt & Kärkkäinen '03  Worst-case O(n log n)、o(n) memory Various Algorithms (1998-2009) Advanced Algorithms / T. Shibuya

Itoh-Tanaka '99 Kurz '99 (Best algorithm for ST)

Kärkkäinen-Sanders '03

Burkhardt-Kärkkäinen '03

Seward '00 Larsson-Sadakane '98 Kim-Jo-Park '04 (Divide and merge)

Manzini-Ferragina '04 Ko-Aluru '03 (Improvement over Baron-Bresler '05 ) (Copy) Seward '00 Schürmann and Stoye '05 (Copy)

Maniscalco-Puglisi '06a/b (Copy ) Nong-Zhang-Chan '09 (Induced sorting) is around here. It could be the final answer!

Fig. 8 from Simon J. Puglisi, W. F. Smyth & Andrew Turpin, "A taxonomy of suffix array construction algorithms", ACM Computing Surveys, 39-2 (2007). Relationships to the Ordinary Sorting Problem Advanced Algorithms / T. Shibuya  Suffix sorting problem is an extension of an ordinary sorting problem They are the same problem if there are no identical number in the input text Cannot be faster than O(n log n) in general!  Neither for the suffix trees! An O(n) algorithm exists for the [1..n] integer alphabet  Bucket sort  SA construction algorithms use normal sorting algorithms inside their algorithms  Algorithms that utilize the quick sort Bentley-Sedgewick, Larsson-Sadakane, etc.  Algorithms that utilize the radix sort Manber-Myers, Kärkkäinen-Sanders, Ko-Aluru, Larsson-Sadakane, etc.  Algorithms that utilize the merge sort Kärkkäinen-Sanders, Ko-Aluru, etc. Quick Sort Advanced Algorithms / T. Shibuya The most fundamental sorting algorithm Worst-case O(n2) / Average-case O(n log n)

1. Choose a pivot

2. Divide into two parts (smaller/larger)

3. Repeat it!

3 2 3 1 3 2 3 iteration Bentley & Sedgewick '97 Advanced Algorithms / T. Shibuya  Ternary quick sort  Quick sort for sorting keywords  Also applicable to suffix arrays  Easiest algorithm to implement!  Time complexity  O(n log n) for random texts  Worst-case O(n2)  Especially bad for texts with many repeats  e.g. "ATATATATATATATATAT" in DNA  e.g. Many "This is" in an English text  Cannot be avoided by random pivoting! Bucket Sort Advanced Algorithms / T. Shibuya A sorting algorithm for [1..n] integer alphabet Just put each number to its corresponding bucket! Computable in linear time!

1 5 4 3 2 6 4 1 3 5 5 2 6

1 2 3 4 5 6 Radix Sort Advanced Algorithms / T. Shibuya Sort from right digit  Utilize the O(n+s) bucket sort s: the alphabet size  O((n+s)b) b: #digits

ATGC TATA GTCA TATA ATGA TATA GTTA ATGA GGGC ATGC CTGC GTCA ATGC GGGG CTGC GTTA ATGA CTGC TGTG GGGC GGGC ATGC GGGC GTCA GGGG GTCA CTGC GGGG ATGA GTCA TGTG GGGC GTGT ATGC GTGT GTGT TGTG TATA CTGC GTTA ATGA GGGG GTTA GTGT TATA GGGG GTGT TGTG GTTA TGTG d=1 d=2 d=3 d=4 Sorted! Do not change the order Manber & Myers '92 Advanced Algorithms / T. Shibuya  Doubling algorithm  Radix sort log n times Replace each sorted substring with its rank in the sorted list Linear time for each radix sort Sort substrings A T A C G T A A C G T A A C T G of length 1

Sort substrings of length 2 Sort substrings of length 4 Sort substrings of length 8

Sort substrings of length 16 Manber & Myers '92 (Example) Advanced Algorithms / T. Shibuya Radix sort Initial text 3 0 4 1 5 9 2 3 5 2 3 1 4 5 2 3 1 2 Replace all the substrings of length 2 with:

► ► ► 04 0 23 5 45 a ► ► ► 12 1 30 6 52 b ► ► ► 14 2 31 7 59 c 2-digit radix sort ► ► ► 15 3 35 8 92 d ► ► 2$ 4 41 9

6 0 9 3 c d 5 8 b 5 7 2 a b 5 7 1 4

Sort them in the next step (with the same 2-digit radix sort) Merge Sort Advanced Algorithms / T. Shibuya We can merge two sorted lists in linear time We can sort an array inO(n log n) by doing it recursively

14<19

sorted array 1 5 14 15 26 31 42 46

sorted array 2 1 4 12 13 19 22 25

merged array 1 4 5 12 13 14 Kärkkäinen & Sanders '03 (1) Advanced Algorithms / T. Shibuya  Consider suffixes of 3 types  P, Q, R  Suffixes that starts at positions 3i, 3i+1, and 3i+2, respectively  Construct the suffix array for (only) P and Q f(2n/3) time  Construct the suffix array for R, using the SA for P and Q O(n) time  Merge the two sorted arrays O(n) time

A T A C G T A A C G T A A C T G P

Q

R

Kärkkäinen, J., and Sanders, P. (2003) "Simple Linear Work Suffix Array Construction", Proc. ICALP, LNCS 2719, pp. 943-955. Kärkkäinen & Sanders '03 (2) Advanced Algorithms / T. Shibuya Total computation time f(n) = f(2n/3) + O(n) i.e., O(n)

If f(n)=c1·f(c2·n)+O(n) (0

2 2 f(n) < c1·f(c2·n) + a·n < c1 ·f(c2 n) + a(1+c)n < log n 2 ··· < c1 ·f(1) + a(1+c+c +···) < a·n/(1-c)+const (0

8 6 2 1 7 3 4 0 8 4 5 A T A C G T A A C T A C C G T G P

Q

Construct SA for 340845$86217 Kärkkäinen & Sanders '03 (4) Advanced Algorithms / T. Shibuya SA for R 1-digit radix sort, using the SA for P Linear time SA for P can be obtained from SA for P+Q

A T A C G T A A C G T A A C T G P

R Kärkkäinen & Sanders '03 (5) Advanced Algorithms / T. Shibuya Merge them in linear time! Naively it takes O(n2) time As comparison of two substrings requires O(n) time A Q-suffix and an R-suffix can be compared in O(1) time, using the result of SA for P+Q It is also true for the comparison of a P-suffix and an R-suffix

A T A C G T A A C G T A A C T G P Q q Compare them instead! r R Induced Sorting (Nong-Zhang-Chan '09) Advanced Algorithms / T. Shibuya More sophisticated design of a sub-problem S-position: T[i] s.t. T[i..n] < T[i+1..n] (lexicographically) L-position: T[i] s.t. T[i..n] > T[i+1..n] (lexicographically) LMS-position: The leftmost S-position among a consecutive set of S-positions At first, construct SA only for suffixes that start at the LMS-positions It is much faster than the KS algorithm, mainly due to the smaller size of the subproblem

LMS (Left Most <)

> < < < > > > < < > > * Applications of the Suffix Trees/Arrays Advanced Algorithms / T. Shibuya  Combinatoral pattern matching  Set matching problem  Matching statistics  Longest common substring  Multiple common substrings  Maximal repeats  Palindrome finding  Compression algorithms  Machine learning  String kernel  Bioinformatics applications  Large-scale alignment  RNA Palindrome detection  Tandem repeat detection  Motif finding  DNA assembly  Primer design Summary Advanced Algorithms / T. Shibuya Matching algorithms Indexing algorithms Next week Text Communication / Compression algorithms