Advanced Algorithms / T. Shibuya
Advanced Algorithms: Text Algorithms
Tetsuo Shibuya
Human Genome Center, Institute of Medical Science (Adjunct at Department of Computer Science) University of Tokyo http://www.hgc.jp/~tshibuya Self Introduction Advanced Algorithms / T. Shibuya Affiliation: Laboratory of Sequence Analysis, Human Genome Center, Institute of Medical Science Adjunct at Department of Computer Science Research Interest Algorithms for bioinformatics/combinatorial pattern matching/big data
Our lab is located at the 4th floor The topics of this part Advanced Algorithms / T. Shibuya Text matching/indexing algorithms (today’s topic) Knuth-Morris-Pratt / Boyer-Moore / suffix arrays / etc Text communication algorithms Hamming coding Text compression algorithms Huffman coding / arithmetic coding / block sorting / etc Text models Markov models / etc
An assignment of this part will be given on the last (i.e., the 3rd) week
- Submit 1 for Prof. Imai’s part (compulsory) AND - Submit 1 for 1 of the remaining 3 parts, i.e., - my part, Hirahara-san’s part, or May Szedak-san’s part Textbooks Advanced Algorithms / T. Shibuya D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997. The most famous book on text processing algorithms, but many parts are out of date. W. Sung, Algorithms in Bioinformatics, CRC Press, 2009. Good introduction for bioinformatics algorithms (mainly on text processing) D. Salomon, G. Motta, Handbook of Data Compression, Springer, 2010. T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 1991. Today's topic Advanced Algorithms / T. Shibuya Text matching algorithms Brute-force algorithm Rabin-Karp algorithm Knuth-Morris-Pratt algorithm Boyer-Moore algorithm Matching automaton Text Indexing algorithms Suffix arrays FM-index Text matching Advanced Algorithms / T. Shibuya
Problem Given Text string T and a pattern (query) P Output Substrings of T that are exactly same as P, if any. exact matching: no insertion / deletion / modification(mutation) Two approaches: Matching and Indexing Preprocess only the query pattern (matching) Preprocess the text beforehand (indexing) - needs extra data structures
Text GGTGAGAAGTTATGATACAGGGTAGTTG TGTCCTTAAGGTGTATAACGATGACATC ACAGGCAGCTCTAATCTCTTGCTATGAG TGATGTAAGATTTATAAGTACGCAAATT
Pattern (Query) TATAA Two types of text matching algorithms Advanced Algorithms / T. Shibuya Brute-force Naive algorithm Fingerprinting (Hash-based) algorithm Rabin-Karp Skipping positions unnecessary to compare Check from left Knuth-Morris-Pratt Aho-Corasick (for multiple queries) Check from right Boyer-Moore Naive algorithm Advanced Algorithms / T. Shibuya Just check one by one at each position O(nm) in the worst case, but... Linear time in average! Not so bad for cases when you have no time to implement:-) But still it's much slower than other sophisticated algorithms in practice.
Text GGGACCAAGTTCCGCACATGCCGGATAGAAT c Pattern c CCGTATG c Average length to check c 2 Check one by one 1+1/4+(1/4) +... = 4/3 (constant!) CCg (for random DNA sequence) .... CCGt .... Rabin-Karp (1) Advanced Algorithms / T. Shibuya Based on fingerprinting (i.e., hashing) Check only if the position has the same fingerprint All the text fingerprints can be obtained in linear time if we use an appropriate fingerprint e.g., hash(x[0..n-1]) = (x[0]dn-1 + x[1]dn-2 + x[2]dn-3 + … + x[n-1]) mod q q : some prime number Text
hash(T[0..|P|-1]) O(1) computation hash(T[1..|P|]) compare with hash(p) at first for each hash(T[2..|P|+1])
Pattern p → hash(p) Rabin-Karp (2) Advanced Algorithms / T. Shibuya
Text
11001101110100101...(16+8+1) mod 5 = 0 O(1) ((0-1·16)·2+1) mod 5 = 4 O(1) ((4-1·16)·2+0) mod 5 = 1 O(1) ((1-0·16)·2+1) mod 5 = 3 check → NO O(1) ((3-0·16)·2+1) mod 5 = 2 O(1) ((2-1·16)·2+1) mod 5 = 3 check → YES!
Pattern 10111 (16+4+2+1) mod 5 = 3 Knuth-Morris-Pratt(1) Advanced Algorithms / T. Shibuya Another way of improvement of the brute-force algorithm The brute-force algorithm sometimes checks the same position more than once, which could be a waste of time → Knuth-Morris-Pratt Algorithm
Text AATACTAGTAGGCATGCCGGAT t t Pattern TAg skip TAGTAGC t t Check from left TAGTAGc t skip We already know the text is "TAGTAG" and t cannot match with the pattern in these TAGt positions before comparison ... Knuth-Morris-Pratt (2) Advanced Algorithms / T. Shibuya P[0..i] matches the text but P[i+1] does not, then FailureLink[i+1]= max j s.t. P[0..j]≡P[i j..i], P[j+1]≠P[i+1], and j
− Longest match with the prefix Failed matching HERE
Should be different(←Knuth) Falure Link
Skip! You don't have to check these positions again! Knuth-Morris-Pratt (3) Advanced Algorithms / T. Shibuya
Text CTACTGATCTGATCGCTAGATGC CTGATCTGC Skip 1 position CTGATCTGC Failed at the first position, so just proceed Pattern CTGATCTGC Overlap of "CTG" CTGATCTGC No overlap CTGATCGC MP skips only 4 positions KMP skips 5 positions Knuth-Morris-Pratt (4) Advanced Algorithms / T. Shibuya Preprocessing A naive algorithm requires O(m2) or even O(m3) time Linear time algorithm exists Use the KMP itself Z algorithm [Gusfield 97] Not faster than the KMP, but easier to understand Z Algorithm (1) Advanced Algorithms / T. Shibuya
Zi Compute it for all i (i >0) Longest common prefix length of S[0..n-1] and S[i..n-1]
Failure links can be easily obtained from Zi values
righti
Max value of x+Zx-1 (x≤i )
lefti
x that takes the maximum value of x+Zx-1 (x≤i ) Initialization Z Z box right0=left0=0 i lefti righti
0 Zi i Zleft_i Z Algorithm (2) Advanced Algorithms / T. Shibuya
Computation of Zi +1
In case i +1≤righti
We have already computed until the position righti
In case Zi < righti -i , we can copy the answer in O(1)
Otherwise compare naively after the position righti ― ①
In case i +1>righti Compare naively ― ② ①+② can be done in linear time in total!
Zi+1=Zi'+1 Zi+1 0 righti-lefti lefti righti
i' i'+1 i i+1 Zleft_i Zleft_i Z Algorithm (3) Advanced Algorithms / T. Shibuya Example
Let's compute Zi for this position!
We have done to this position
Text ATGCGCATAATGCGCTGAATGGCCATAATCTGAA Zi 0000002016000000013000002012000011
The same text left right
Just copy the numbers if the numbers are smaller than 3 Computational Time Complexity of KMP Advanced Algorithms / T. Shibuya O(m+n) n: text length, m: pattern length Worst-case time complexity #comparison < 2n If comparison succeeds, it will be never compared again. » i.e., at most n times If comparison fails, i always increases » i.e., at most n-1 times But this algorithm requires access to all the positions in the text Can we reduce it? Boyer-Moore Algorithm Boyer-Moore (1) Advanced Algorithms / T. Shibuya Idea Almost the same as KMP, but check from right! Practically faster than KMP Better average-case time complexity butWorse worst-case time complexity Text
AATTGTTCCGGCCATGCCGGAT ...... T Pattern .....TT GTTCGTT ....GTT failed ...cGTT Skip based on the information of "GTT" gtt...t failed failed Skip based on the ....g.t information of "G" Boyer-Moore (2) Advanced Algorithms / T. Shibuya Two rules Bad character rule If the character at the failed position is x, we can move the last x in the pattern to the position The algorithm that uses only (a variation of) this rule is called Horspool Algorithm (Strong) Good suffix rule Strong: the character before the same substring must be different This constraint was not used in the original BM algorithm cf. Knuth's rule in KMP Do the larger shift of the above two
Success Success Failed Failed
Different = strong Boyer-Moore (3) Advanced Algorithms / T. Shibuya Bad character rule example
Pattern TTCCAAGTCGCC Do not consider the last character
Failed Text CCCTGTCCATGCCGTCAGCCC TTCCAAGTCGCC
TTCCAAGLastT TCGCC Boyer-Moore (4) Advanced Algorithms / T. Shibuya (Strong) Good suffix rule example
Pattern CGTATATCCAATATC
Failed Text AGTCCCTCGGTCCGATATCGACCCTCCCG CGTATATCCAATATC CGTATATCCAATATC Boyer-Moore (5) Advanced Algorithms / T. Shibuya Preprocess Bad character rule Very easy Good suffix rule Linear time by using the Z algorithm from backward Boyer-Moore (6) Advanced Algorithms / T. Shibuya Computational time complexity Average-case O(n/min (m, alphabet size)) i.e., average-case skip length is O(min(m, alphabet size)) Horspool algorithm has the same time complexity Worst-case O(nm) Bad for cases: Many repeats » KMP is faster Small alphabet size » Shift-Or is faster Linear time for finding only 1 occurrence Good for grep in editors KMP and an automaton Advanced Algorithms / T. Shibuya The KMP can be represented by an automaton
A T A T T G
Failure link Aho-Corasick (1) Advanced Algorithms / T. Shibuya The automaton can be extended for multiple queries! Linear time construction! Linear time searching! T C
T Failure Link T
A G C
T G C
T T Link to the root if not specified Aho-Corasick (2) Advanced Algorithms / T. Shibuya Construction of the keyword tree O(M) time M: Sum of query string lengths Alphabet size: fixed Can be used for dictionary searching T C
T T
A G C
T G C
T T Aho-Corasick (3) Advanced Algorithms / T. Shibuya Breadth-first searching Start from the root v a No failure link at the root FailureLink(v) b Traverse FailureLinks of v's parent to find a node that have a a child w with the same label, w and let (the nearest) w be c FailureLink(v) If no such node exists, let FailureLink(v) = root a
b Aho-Corasick (4) Advanced Algorithms / T. Shibuya Why it is linear time?
failure links to be made
1 shorter suffix
traverse at most O(m) nodes
Existing paths from the root in the tree
root
All the suffixes of some pattern Aho-Corasick (5) Advanced Algorithms / T. Shibuya OutLink(v) Pointers to the nodes with the alphabet thatv must outputs Computation of OutLink() Traverse the failure links to find a leaf if any If there's no such leaf, there's no need to set the outlink Also in linear time
o g e t h e r 1 1 together t t h e r 2 ether e 2 3 get e r 4 her h 4 g 5 he 5 e t 3 Regular expression search based on automata (1) Advanced Algorithms / T. Shibuya Regular expression Concatenation A, B → AB Or A, B → A+B Repeat A → A* Extension of Aho-Corasick ABAB ABBB ABAABB AB(A+B)(AB+CD)*B ABACDB ABBABB ABBCDB ABAABABB ABAABCDB ... Regular expression search based on automata (2) Advanced Algorithms / T. Shibuya Construct the automaton for a regular expression A C (A*B+AC)D D End Start ε ε B
A
AB A+B A A* ε A B Next Next ε Next B A Regular expression search based on automata (3) Advanced Algorithms / T. Shibuya
O(nm) A C 0 4 D 5 6 End 2 ε B Start ε 7 8
3 A 1 CDAABCAAABDDACDAAC 000000000000000000 You can start anywhere Reachable nodes 113 11137 1 11 55 555 567556 (Not including ε Found! states) 8 8 DP But… Advanced Algorithms / T. Shibuya All the mentioned algorithms requires ( ) time, which means too slow for VERY BIG DATA 𝑂𝑂 𝑛𝑛 � 𝑓𝑓 𝑚𝑚
1E+16
1E+15 Size of the SRA database (10x in 18 months)
1E+14
1E+13 #bases 1E+12 Moore's Law (2x in 18 months)
1E+11
1E+10 2007 2008 2009 2010 2011 2012 2013 http://www.ncbi.nlm.nih.gov/sra SRA Database Size Keyword Tree (cf. Aho-Corasick Algorithm) Advanced Algorithms / T. Shibuya A tree data structure for searching on a dictionary Linear-time constructible Linear-time searchable
T C
T T
A G C
T G C
T T Suffix Automata Advanced Algorithms / T. Shibuya We can efficiently search for any substring of a given text if we have the keyword tree for all the suffixes of the text O(n) substrings of length O(n) O(n2) space? Add '$' at the end of the text, so that all the suffixes end at leaves '$' is a character that does not appear anywhere in the text
Suffix automata of 'mississippi$' s i All the mississippi$ p suffixes ississippi$ ssi si ssissippi$ $ issi i sissippi$ issippi$ i$ ssippi$ ppi$ pi$ sippi$ ippi$ ppi$ ppi$ ppi$ ppi$ pi$ i$ ssippi$ mississippi$ ssippi$ ssippi$ Suffix Trees Advanced Algorithms / T. Shibuya O(n) representation of the suffix automaton Eliminate nodes with only one child Represent substrings with their indices O(1) memory requirement for 1 node/edge e.g., T[3..5] = 'ssi'
Suffix tree of 'mississippi$' s i All the mississippi$ p suffixes ississippi$ ssi si ssissippi$ $ i$ pi$ i sissippi$ issippi$ ppi$ ssippi$ ppi$ sippi$ ppi$ ippi$ ppi$ ssippi$ ppi$ mississippi$ pi$ ssippi$ ssippi$ i$ Suffix tree construction algorithms Advanced Algorithms / T. Shibuya History Weiner '73 O(n s) (s: alphabet size) McCreight '76 O(n log s) Ukkonen '95 O(n log s) On-line computation of the McCreight Algorithm Farach '97 O(n) for the integer alphabet [1..n] Construction from suffix arrays An O(n) suffix array construction algorithm ([Kärkkäinen & Sanders '03] etc.) O(n) conversion from suffix arrays [Kasai et al '01] How to search for a number from an array of numbers Advanced Algorithms / T. Shibuya Find a number from an array Requires O(n) time if we do not pre-process the array
3 14 23 5 4 11 99 38 26 22 15 17 31 18 If the numbers are sorted... Advanced Algorithms / T. Shibuya O(log n) binary search Check the number in the middle If larger, see left. Otherwise see right. Repeat it!
3 14 23 5 4 11 99 38 26 22 15 17 31 18
Sort
3 4 5 11 14 15 17 18 22 23 26 31 38 99
① ③ ② Keyword Searching on a Sorted Keywords Advanced Algorithms / T. Shibuya Similarly, we can perform the binary search on a sorted keyword list This can be an alternative of the keyword tree
Sorted Dictionary
Pattern A little smarter searching Advanced Algorithms / T. Shibuya A smarter binary search Do not check unnecessary part of the keywords The time complexity is the same, but practically it's much faster It could be further improved to O(m+log n) with some additional data structure (or even O(m) for some special cases)
L Start comparison middle from here!
A Part same as R the query Suffix arrays Advanced Algorithms / T. Shibuya The sorted list of all the suffixes n suffixes of O(n) lengths O(n2) space? O(n) representation is possible! Represent the suffixes by their indices! as the edges of the suffix trees
All 0: mississippi$ 10: i$ Suffixes 1: ississippi$ 7: ippi$ 2: ssissippi$ 4: issippi$ 3: sissippi$ 1: ississippi$ 4: issippi$ 0: mississippi$ 5: ssippi$ 9: pi$ 6: sippi$ Sort 8: ppi$ 7: ippi$ 6: sippi$ 8: ppi$ 3: sissippi$ 9: pi$ 5: ssippi$ 10: i$ 2: ssissippi$ SA vs ST Advanced Algorithms / T. Shibuya Suffix arrays corresponds to the leaves of the suffix trees So we can construct suffix arrays (in linear time) from suffix trees very easily in linear time But note that we can also construct suffix trees from suffix arrays in linear time with Kasai et al's algorithm Suffix arrays are much smaller 14n bytes vs 5n bytes (for 32 bit/English alphabet case) Thus it is not good to build suffix arrays via suffix trees Searching speed is not so different Suffix arrays can achieve theoretically/practically almost the same performance as the suffix trees with some additional data structure(s) Many applications Not so different. Both have many applications.
s i 10: i$ p 7: ippi$ All the si 4: issippi$ mississippi$ ssi suffixes $ i$ pi$ i ississippi$ 1: ississippi$ ssissippi$ ppi$ 0: mississippi$ sissippi$ issippi$ ppi$ 9: pi$ ssippi$ ppi$ 8: ppi$ sippi$ ssippi$ 6: sippi$ ippi$ ppi$ 3: sissippi$ ppi$ ssippi$ mississippi$ pi$ ssippi$ 5: ssippi$ i$ 2: ssissippi$ Suffix Tree Suffix Array Kasai-Lee-Arimura-Park Algorithm (1/3) Advanced Algorithms / T. Shibuya Height array LCP (longest common prefix) lengths of adjacent suffixes in a suffix array e.g. LCP between "ississippi" and "issippi" is 4 Small values So we can store it in 1n byte memory or so Many applications Faster worst-case O(m+log n) search Faster LCP computation between arbitrary suffixes Linear-time construction of suffix trees from suffix arrays
10: i$ 1 7: ippi$ 1 4: issippi$ 4 1: ississippi$ 0 0: mississippi$ 0 Height Array 9: pi$ 1 8: ppi$ 0 6: sippi$ 2 3: sissippi$ 1 5: ssippi$ 3 2: ssissippi$ Kasai-Lee-Arimura-Park Algorithm (2/3) Advanced Algorithms / T. Shibuya Linear-time height array construction Just compute it in the order of the original positions Utilize the inversed suffix array
hi Ti
Use the result of the comparison for the previous suffix ↓ Ti+1 At most 2n comparison
hi T Ti This part exists at somewhere else on T Ti+1 Kasai-Lee-Arimura-Park Algorithm (3/3) Advanced Algorithms / T. Shibuya Linear-time suffix tree construction using the height array Just construct it from left!
7 12 12 7
12 12 12 4
16 7 16 Check from the leaf to 12 12 12 16 find where to insert next 16 Suffix Array Construction Algorithms (1992-2009) Advanced Algorithms / T. Shibuya Naively from suffix trees O(n) but requires too large memory (i.e., 14n byte) No faster than direct construction algorithms (practically/theoretically) Ternary quick sort Bentley & Sedgewick '97 Fast for random texts (O(n log n)) Worst-case O(n2) Doubling algorithm Manber & Myers '92, Larsson and Sadakane '98. Worst-case O(n log n) Copy-based (BWT-like) algorithm Itoh-Tanaka '99, Seward '00, Manzini-Ferragina '02, Schürmann and Stoye '05. Small memory. worst-case O(n2 log n). Practically fast. Divide and merge algorithm (Similar to Farach's algorithm) Kärkkäinen & Sanders '03, Ko & Aluru '03, Kim et al. '03, Hon et al. '03, Na '05, Nong et al '09 (Induced sorting). Worst-case O(n) Burkhardt & Kärkkäinen '03 Worst-case O(n log n)、o(n) memory Various Algorithms (1998-2009) Advanced Algorithms / T. Shibuya
Itoh-Tanaka '99 Kurz '99 (Best algorithm for ST)
Kärkkäinen-Sanders '03
Burkhardt-Kärkkäinen '03
Seward '00 Larsson-Sadakane '98 Kim-Jo-Park '04 (Divide and merge)
Manzini-Ferragina '04 Ko-Aluru '03 (Improvement over Baron-Bresler '05 ) (Copy) Seward '00 Schürmann and Stoye '05 (Copy)
Maniscalco-Puglisi '06a/b (Copy ) Nong-Zhang-Chan '09 (Induced sorting) is around here. It could be the final answer!
Fig. 8 from Simon J. Puglisi, W. F. Smyth & Andrew Turpin, "A taxonomy of suffix array construction algorithms", ACM Computing Surveys, 39-2 (2007). Relationships to the Ordinary Sorting Problem Advanced Algorithms / T. Shibuya Suffix sorting problem is an extension of an ordinary sorting problem They are the same problem if there are no identical number in the input text Cannot be faster than O(n log n) in general! Neither for the suffix trees! An O(n) algorithm exists for the [1..n] integer alphabet Bucket sort SA construction algorithms use normal sorting algorithms inside their algorithms Algorithms that utilize the quick sort Bentley-Sedgewick, Larsson-Sadakane, etc. Algorithms that utilize the radix sort Manber-Myers, Kärkkäinen-Sanders, Ko-Aluru, Larsson-Sadakane, etc. Algorithms that utilize the merge sort Kärkkäinen-Sanders, Ko-Aluru, etc. Quick Sort Advanced Algorithms / T. Shibuya The most fundamental sorting algorithm Worst-case O(n2) / Average-case O(n log n)
1. Choose a pivot
2. Divide into two parts (smaller/larger)
3. Repeat it!
3 2 3 1 3 2 3 iteration Bentley & Sedgewick '97 Advanced Algorithms / T. Shibuya Ternary quick sort Quick sort for sorting keywords Also applicable to suffix arrays Easiest algorithm to implement! Time complexity O(n log n) for random texts Worst-case O(n2) Especially bad for texts with many repeats e.g. "ATATATATATATATATAT" in DNA e.g. Many "This is" in an English text Cannot be avoided by random pivoting! Bucket Sort Advanced Algorithms / T. Shibuya A sorting algorithm for [1..n] integer alphabet Just put each number to its corresponding bucket! Computable in linear time!
1 5 4 3 2 6 4 1 3 5 5 2 6
1 2 3 4 5 6 Radix Sort Advanced Algorithms / T. Shibuya Sort from right digit Utilize the O(n+s) bucket sort s: the alphabet size O((n+s)b) b: #digits
ATGC TATA GTCA TATA ATGA TATA GTTA ATGA GGGC ATGC CTGC GTCA ATGC GGGG CTGC GTTA ATGA CTGC TGTG GGGC GGGC ATGC GGGC GTCA GGGG GTCA CTGC GGGG ATGA GTCA TGTG GGGC GTGT ATGC GTGT GTGT TGTG TATA CTGC GTTA ATGA GGGG GTTA GTGT TATA GGGG GTGT TGTG GTTA TGTG d=1 d=2 d=3 d=4 Sorted! Do not change the order Manber & Myers '92 Advanced Algorithms / T. Shibuya Doubling algorithm Radix sort log n times Replace each sorted substring with its rank in the sorted list Linear time for each radix sort Sort substrings A T A C G T A A C G T A A C T G of length 1
Sort substrings of length 2 Sort substrings of length 4 Sort substrings of length 8
Sort substrings of length 16 Manber & Myers '92 (Example) Advanced Algorithms / T. Shibuya Radix sort Initial text 3 0 4 1 5 9 2 3 5 2 3 1 4 5 2 3 1 2 Replace all the substrings of length 2 with:
► ► ► 04 0 23 5 45 a ► ► ► 12 1 30 6 52 b ► ► ► 14 2 31 7 59 c 2-digit radix sort ► ► ► 15 3 35 8 92 d ► ► 2$ 4 41 9
6 0 9 3 c d 5 8 b 5 7 2 a b 5 7 1 4
Sort them in the next step (with the same 2-digit radix sort) Merge Sort Advanced Algorithms / T. Shibuya We can merge two sorted lists in linear time We can sort an array inO(n log n) by doing it recursively
14<19
sorted array 1 5 14 15 26 31 42 46
sorted array 2 1 4 12 13 19 22 25
merged array 1 4 5 12 13 14 Kärkkäinen & Sanders '03 (1) Advanced Algorithms / T. Shibuya Consider suffixes of 3 types P, Q, R Suffixes that starts at positions 3i, 3i+1, and 3i+2, respectively Construct the suffix array for (only) P and Q f(2n/3) time Construct the suffix array for R, using the SA for P and Q O(n) time Merge the two sorted arrays O(n) time
A T A C G T A A C G T A A C T G P
Q
R
Kärkkäinen, J., and Sanders, P. (2003) "Simple Linear Work Suffix Array Construction", Proc. ICALP, LNCS 2719, pp. 943-955. Kärkkäinen & Sanders '03 (2) Advanced Algorithms / T. Shibuya Total computation time f(n) = f(2n/3) + O(n) i.e., O(n)
If f(n)=c1·f(c2·n)+O(n) (0 2 2 f(n) < c1·f(c2·n) + a·n < c1 ·f(c2 n) + a(1+c)n < log n 2 ··· < c1 ·f(1) + a(1+c+c +···) < a·n/(1-c)+const (0 8 6 2 1 7 3 4 0 8 4 5 A T A C G T A A C T A C C G T G P Q Construct SA for 340845$86217 Kärkkäinen & Sanders '03 (4) Advanced Algorithms / T. Shibuya SA for R 1-digit radix sort, using the SA for P Linear time SA for P can be obtained from SA for P+Q A T A C G T A A C G T A A C T G P R Kärkkäinen & Sanders '03 (5) Advanced Algorithms / T. Shibuya Merge them in linear time! Naively it takes O(n2) time As comparison of two substrings requires O(n) time A Q-suffix and an R-suffix can be compared in O(1) time, using the result of SA for P+Q It is also true for the comparison of a P-suffix and an R-suffix A T A C G T A A C G T A A C T G P Q q Compare them instead! r R Induced Sorting (Nong-Zhang-Chan '09) Advanced Algorithms / T. Shibuya More sophisticated design of a sub-problem S-position: T[i] s.t. T[i..n] < T[i+1..n] (lexicographically) L-position: T[i] s.t. T[i..n] > T[i+1..n] (lexicographically) LMS-position: The leftmost S-position among a consecutive set of S-positions At first, construct SA only for suffixes that start at the LMS-positions It is much faster than the KS algorithm, mainly due to the smaller size of the subproblem LMS (Left Most <) > < < < > > > < < > > * Applications of the Suffix Trees/Arrays Advanced Algorithms / T. Shibuya Combinatoral pattern matching Set matching problem Matching statistics Longest common substring Multiple common substrings Maximal repeats Palindrome finding Compression algorithms Machine learning String kernel Bioinformatics applications Large-scale alignment RNA Palindrome detection Tandem repeat detection Motif finding DNA assembly Primer design Summary Advanced Algorithms / T. Shibuya Matching algorithms Indexing algorithms Next week Text Communication / Compression algorithms