Advanced Algorithms / T

Advanced Algorithms / T. Shibuya Advanced Algorithms: Text Algorithms Tetsuo Shibuya Human Genome Center, Institute of Medical Science (Adjunct at Department of Computer Science) University of Tokyo http://www.hgc.jp/~tshibuya Self Introduction Advanced Algorithms / T. Shibuya Affiliation: Laboratory of Sequence Analysis, Human Genome Center, Institute of Medical Science Adjunct at Department of Computer Science Research Interest Algorithms for bioinformatics/combinatorial pattern matching/big data Our lab is located at the 4th floor The topics of this part Advanced Algorithms / T. Shibuya Text matching/indexing algorithms (today’s topic) Knuth-Morris-Pratt / Boyer-Moore / suffix arrays / etc Text communication algorithms Hamming coding Text compression algorithms Huffman coding / arithmetic coding / block sorting / etc Text models Markov models / etc An assignment of this part will be given on the last (i.e., the 3rd) week - Submit 1 for Prof. Imai’s part (compulsory) AND - Submit 1 for 1 of the remaining 3 parts, i.e., - my part, Hirahara-san’s part, or May Szedak-san’s part Textbooks Advanced Algorithms / T. Shibuya D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge Press, 1997. The most famous book on text processing algorithms, but many parts are out of date. W. Sung, Algorithms in Bioinformatics, CRC Press, 2009. Good introduction for bioinformatics algorithms (mainly on text processing) D. Salomon, G. Motta, Handbook of Data Compression, Springer, 2010. T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 1991. Today's topic Advanced Algorithms / T. Shibuya Text matching algorithms Brute-force algorithm Rabin-Karp algorithm Knuth-Morris-Pratt algorithm Boyer-Moore algorithm Matching automaton Text Indexing algorithms Suffix arrays FM-index Text matching Advanced Algorithms / T. Shibuya Problem Given Text string T and a pattern (query) P Output Substrings of T that are exactly same as P, if any. exact matching： no insertion / deletion / modification(mutation) Two approaches: Matching and Indexing Preprocess only the query pattern (matching) Preprocess the text beforehand (indexing) - needs extra data structures Text GGTGAGAAGTTATGATACAGGGTAGTTG TGTCCTTAAGGTGTATAACGATGACATC ACAGGCAGCTCTAATCTCTTGCTATGAG TGATGTAAGATTTATAAGTACGCAAATT Pattern (Query) TATAA Two types of text matching algorithms Advanced Algorithms / T. Shibuya Brute-force Naive algorithm Fingerprinting (Hash-based) algorithm Rabin-Karp Skipping positions unnecessary to compare Check from left Knuth-Morris-Pratt Aho-Corasick (for multiple queries) Check from right Boyer-Moore Naive algorithm Advanced Algorithms / T. Shibuya Just check one by one at each position O(nm) in the worst case, but... Linear time in average! Not so bad for cases when you have no time to implement:-) But still it's much slower than other sophisticated algorithms in practice. Text GGGACCAAGTTCCGCACATGCCGGATAGAAT c Pattern c CCGTATG c Average length to check c 2 Check one by one 1+1/4+(1/4) +... = 4/3 (constant!) CCg (for random DNA sequence) .... CCGt .... Rabin-Karp (1) Advanced Algorithms / T. Shibuya Based on fingerprinting (i.e., hashing) Check only if the position has the same fingerprint All the text fingerprints can be obtained in linear time if we use an appropriate fingerprint e.g., hash(x[0..n-1]) = (x[0]dn-1 + x[1]dn-2 + x[2]dn-3 + … + x[n-1]) mod q q : some prime number Text hash(T[0..|P|-1]) O(1) computation hash(T[1..|P|]) compare with hash(p) at first for each hash(T[2..|P|+1]) Pattern p → hash(p) Rabin-Karp (2) Advanced Algorithms / T. Shibuya Text 11001101110100101...(16+8+1) mod 5 = 0 O(1) ((0-1·16)·2+1) mod 5 = 4 O(1) ((4-1·16)·2+0) mod 5 = 1 O(1) ((1-0·16)·2+1) mod 5 = 3 check → NO O(1) ((3-0·16)·2+1) mod 5 = 2 O(1) ((2-1·16)·2+1) mod 5 = 3 check → YES! Pattern 10111 (16+4+2+1) mod 5 = 3 Knuth-Morris-Pratt(1) Advanced Algorithms / T. Shibuya Another way of improvement of the brute-force algorithm The brute-force algorithm sometimes checks the same position more than once, which could be a waste of time → Knuth-Morris-Pratt Algorithm Text AATACTAGTAGGCATGCCGGAT t t Pattern TAg skip TAGTAGC t t Check from left TAGTAGc t skip We already know the text is "TAGTAG" and t cannot match with the pattern in these TAGt positions before comparison ... Knuth-Morris-Pratt (2) Advanced Algorithms / T. Shibuya P[0..i] matches the text but P[i+1] does not, then FailureLink[i+1]= max j s.t. P[0..j]≡P[i j..i], P[j+1]≠P[i+1], and j <i if no such j exists, let FL[i]= 2 if P[i+1]=P[0], otherwise let FL[i]= 1. FailureLink[i] can be computed before searching− the text! − − We can skip i+1 FailureLink[i+1] characters − Longest match with the prefix Failed matching HERE Should be different（←Knuth） Falure Link Skip! You don't have to check these positions again! Knuth-Morris-Pratt (3) Advanced Algorithms / T. Shibuya Text CTACTGATCTGATCGCTAGATGC CTGATCTGC Skip 1 position CTGATCTGC Failed at the first position, so just proceed Pattern CTGATCTGC Overlap of "CTG" CTGATCTGC No overlap CTGATCGC MP skips only 4 positions KMP skips 5 positions Knuth-Morris-Pratt (4) Advanced Algorithms / T. Shibuya Preprocessing A naive algorithm requires O(m2) or even O(m3) time Linear time algorithm exists Use the KMP itself Z algorithm [Gusfield 97] Not faster than the KMP, but easier to understand Z Algorithm (1) Advanced Algorithms / T. Shibuya Zi Compute it for all i (i >0) Longest common prefix length of S[0..n-1] and S[i..n-1] Failure links can be easily obtained from Zi values righti Max value of x+Zx-1 (x≤i ) lefti x that takes the maximum value of x+Zx-1 (x≤i ) Initialization Z Z box right0=left0=0 i lefti righti 0 Zi i Zleft_i Z Algorithm (2) Advanced Algorithms / T. Shibuya Computation of Zi +1 In case i +1≤righti We have already computed until the position righti In case Zi < righti -i , we can copy the answer in O(1) Otherwise compare naively after the position righti ― ① In case i +1>righti Compare naively ― ② ①+② can be done in linear time in total! Zi+1=Zi'+1 Zi+1 0 righti-lefti lefti righti i' i'+1 i i+1 Zleft_i Zleft_i Z Algorithm (3) Advanced Algorithms / T. Shibuya Example Let's compute Zi for this position! We have done to this position Text ATGCGCATAATGCGCTGAATGGCCATAATCTGAA Zi 0000002016000000013000002012000011 The same text left right Just copy the numbers if the numbers are smaller than 3 Computational Time Complexity of KMP Advanced Algorithms / T. Shibuya O(m+n) n: text length, m: pattern length Worst-case time complexity #comparison < 2n If comparison succeeds, it will be never compared again. » i.e., at most n times If comparison fails, i always increases » i.e., at most n-1 times But this algorithm requires access to all the positions in the text Can we reduce it? Boyer-Moore Algorithm Boyer-Moore (1) Advanced Algorithms / T. Shibuya Idea Almost the same as KMP, but check from right! Practically faster than KMP Better average-case time complexity butWorse worst-case time complexity Text AATTGTTCCGGCCATGCCGGAT ......T Pattern .....TT GTTCGTT ....GTT failed ...cGTT Skip based on the information of "GTT" gtt...t failed failed Skip based on the ....g.t information of "G" Boyer-Moore (2) Advanced Algorithms / T. Shibuya Two rules Bad character rule If the character at the failed position is x, we can move the last x in the pattern to the position The algorithm that uses only (a variation of) this rule is called Horspool Algorithm (Strong) Good suffix rule Strong: the character before the same substring must be different This constraint was not used in the original BM algorithm cf. Knuth's rule in KMP Do the larger shift of the above two Success Success Failed Failed Different = strong Boyer-Moore (3) Advanced Algorithms / T. Shibuya Bad character rule example Pattern TTCCAAGTCGCC Do not consider the last character Failed Text CCCTGTCCATGCCGTCAGCCC TTCCAAGTCGCC TTCCAAGLastT TCGCC Boyer-Moore (4) Advanced Algorithms / T. Shibuya (Strong) Good suffix rule example Pattern CGTATATCCAATATC Failed Text AGTCCCTCGGTCCGATATCGACCCTCCCG CGTATATCCAATATC CGTATATCCAATATC Boyer-Moore (5) Advanced Algorithms / T. Shibuya Preprocess Bad character rule Very easy Good suffix rule Linear time by using the Z algorithm from backward Boyer-Moore (6) Advanced Algorithms / T. Shibuya Computational time complexity Average-case O(n/min (m, alphabet size)) i.e., average-case skip length is O(min(m, alphabet size)) Horspool algorithm has the same time complexity Worst-case O(nm) Bad for cases: Many repeats » KMP is faster Small alphabet size » Shift-Or is faster Linear time for finding only 1 occurrence Good for grep in editors KMP and an automaton Advanced Algorithms / T. Shibuya The KMP can be represented by an automaton A T A T T G Failure link Aho-Corasick (1) Advanced Algorithms / T. Shibuya The automaton can be extended for multiple queries! Linear time construction! Linear time searching! T C T Failure Link T A G C T G C T T Link to the root if not specified Aho-Corasick (2) Advanced Algorithms / T. Shibuya Construction of the keyword tree O(M) time M: Sum of query string lengths Alphabet size: fixed Can be used for dictionary searching T C T T A G C T G C T T Aho-Corasick (3) Advanced Algorithms / T. Shibuya Breadth-first searching Start from the root v a No failure link at the root FailureLink(v) b Traverse FailureLinks of v's parent to find a node that have a a child w with the same label, w and let (the nearest) w be c FailureLink(v) If no such node exists, let FailureLink(v) = root a b Aho-Corasick (4) Advanced Algorithms / T.

Load more