Suffix Trees

Suffix Trees

Suffix Trees Outline for Today ● Review from Last Time ● A quick refresher on tries. ● Suffix Tries ● A simple data structure for string searching. ● Suffix Trees ● A compact, powerful, and flexible data structure for string algorithms. ● Generalized Suffix Trees ● An even more flexible data structure. Review from Last Time A D B C B D A E A I O A R D T N T K U G D N A E D T T E I I A O K T Tries ● A trie is a tree that stores a collection of strings over some alphabet Σ. ● Each node corresponds to a prefix of some string in the set. ● Tries are sometimes called “prefix trees.” ● If |Σ| = O(1), all insertions, deletions, and lookups take time O(|w|), where w is the string in question. ● Can also determine whether a string w is a prefix of some string in the trie in time O(|w|) by walking the trie and returning whether we didn't fall off. Aho-Corasick String Matching ● The Aho-Corasick string matching algorithm is an algorithm for finding all occurrences of a set of strings P₁, …, Pₖ inside a string T. ● Runtime is O(m + n + z), where ● m = |T|, ● n = |P₁| + … + |Pₖ| ● z is the number of matches. Aho-Corasick String Matching ● The runtime of Aho-Corasick can be split apart into two pieces: ● O(n) preprocessing time to build the matcher, and ● O(m + z) time to find all matches. ● Useful in the case where the patterns are fixed, but the text might change. Genomics Databases ● Many string algorithms these days pertain to computational genomics. ● Typically, have a huge database with many very large strings. ● More common problem: given a fixed string T to search and changing patterns P₁, …, Pₖ, find all matches in T. ● Question: Can we instead preprocess T to make it easy to search for variable patterns? Suffix Tries Substrings, Prefixes, and Suffixes ● Recall: If x is a substring of w, then x is a suffix of a prefix of w. ● Write w = αxω; then x is a suffix of αx. ● Fact: If x is a substring of w, then x is a prefix of a suffix of w. ● Write w = αxω; then x is a prefix of xω ● This second fact is of use because tries support efficient prefix searching. Suffix Tries e s ● A suffix trie of T is a trie n o of all the suffices of T. n o s n e ● In time O(n), can determine whether P₁, …, s n e s n Pₖ exist in T by searching e s n e s for each one in the trie. e s n e n e s s e e nonsense A Typical Transform $ s n ● Typically, we e o append some new $ n o s n e character $ ∉ Σ to the end of T, then s n e s $ n construct the trie e s $ n e s for T$. $ e s n e ● Leaf nodes correspond to n e s $ suffixes. s $ e ● Internal nodes correspond to e $ prefixes of those suffixes. $ nonsense$ Constructing Suffix Tries ● Once we build a single suffix trie for string T, we can efficiently detect whether patterns match in time O(n). ● Question: How long does it take to construct a suffix trie? ● Problem: There's an Ω(m2) lower bound on the worst-case complexity of any algorithm for building suffix tries. A Degenerate Case $ a b $ a b b … b b … a $ ambm$ b b … b b b … b $ b … b $ ThereThere are are Θ( Θ(mm)) … b $ copiescopies of of nodes nodes chainedchained together together as as b $ m bbm$$.. $ 2 SpaceSpace usage: usage: Ω( Ω(mm2).). Correcting the Problem ● Because suffix tries may have Ω(m2) nodes, all suffix trie algorithms must run in time Ω(m2) in the worst-case. ● Can we reduce the number of nodes in the trie? Patricia Tries $ s n ● A “silly” node in a e o trie is a node that $ n o s n e has exactly one child. s n e s $ n ● A Patricia trie (or e s $ n e s radix trie) is a trie where all “silly” $ e s n e nodes are merged n e s $ with their parents. s $ e e $ $ nonsense$ Patricia Tries ● A “silly” node in a $ e s trie is a node that n o e has exactly one n $ n o s s $ child. e n s n e s ● A Patricia trie (or e s n s e radix trie) is a trie $ e $ n $ n e where all “silly” s $ s e nodes are merged e $ with their parents. $ nonsense$ 012345678 Suffix Trees ● A suffix tree for $ e s a string T is an n o e Patricia trie of T$ n 8 $ n o s s $ where each leaf is e n s n e s labeled with the 7 e s n s 6 e index where the $ e $ n $ n e corresponding s $ 4 s 5 e suffix starts in T$. e $ 3 $ 1 2 0 nonsense$ 012345678 Properties of Suffix Trees ● If |T| = m, the $ e s suffix tree has n o e exactly m + 1 n 8 $ n o s s $ leaf nodes. e n s n e s 7 e s n ● For any T ≠ ε, all s 6 e $ e $ n $ internal nodes in n e s $ the suffix tree 4 s 5 e e $ 3 have at least two $ 1 children. 2 ● Number of nodes 0 in a suffix tree is nonsense$ 012345678 Θ( m). Suffix Tree Representations ● Suffix trees may have Θ(m) nodes, but the labels on the edges can have size ω(1). ● This means that a naïve representation of a suffix tree may take ω(m) space. ● Useful fact: Each edge in a suffix tree is labeled with a consecutive range of characters from w. ● Trick: Represent each edge label α as a pair of integers [start, end] representing where in the string α appears. Suffix Tree Representations $ e s n o e n $ s 8 n o s $ n s n e e n s 7 e s 6 e $ e n o s $ e $ s n e $ start 8 4 0 1 3 n s s $ end 4 5 e 8 4 0 8 4 e $ 3 child $ 1 2 0 nonsense$ 012345678 Building Suffix Trees ● Using this representation, suffix trees can be constructed using space Θ(m). ● Claim: There are Θ(m)-time algorithms for building suffix trees. ● These algorithms are not trivial. We'll discuss one of them next time. An Application: String Matching String Matching ● Given a suffix tree, $ e s can search to see if n o e a pattern P exists in n 8 $ n o s s $ time O(n). e n s n e s ● Gives an O(m + n) 7 e s n s 6 e string-matching $ e $ n $ n e algorithm. s $ s e 4 e 5 3 ● T can be $ $ 1 preprocessed in time O(m) to 2 0 efficiently support binary string nonsense$ matching queries. 012345678 String Matching ● Claim: After $ e s spending O(m) time n o e preprocessing T$, n $ s can find all 8 n o s $ n s e e matches of a string n s 7 e s n P in time O(n + z), s 6 e $ e $ n $ where z is the n e s $ number of matches. 4 s 5 e e $ 3 $ 1 2 0 nonsense$ 012345678 String Matching ● Claim: After $ e s spending O(m) time n o e preprocessing T$, n $ s can find all 8 n o s $ n s e e matches of a string n s 7 e s n P in time O(n + z), s 6 e $ e $ n $ where z is the n e s $ number of matches. 4 s 5 e e $ 3 $ 1 Observation 1: Every Observation 1: Every 2 occurrenceoccurrence of of P P in in T T is is a a 0 prefixprefix of of some some suffix suffix of of T T.. nonsense$ 012345678 String Matching ● Claim: After $ e s spending O(m) time n o e preprocessing T$, n $ s can find all 8 n o s $ n s e e matches of a string n s 7 e s n P in time O(n + z), s 6 e $ e $ n $ where z is the n e s $ number of matches. 4 s 5 e e $ 3 $ 1 Observation 2: Because Observation 2: Because 2 thethe prefix prefix is is the the same same 0 eacheach time time (namely, (namely, P P),), all all thosethose suffixes suffixes will will be be in in nonsense$ thethe same same subtree. subtree. 012345678 String Matching ● Claim: After $ e s spending O(m) time n o e preprocessing T$, n $ s can find all 8 n o s $ n s e e matches of a string n s 7 e s n P in time O(n + z), s 6 e $ e $ n $ where z is the n e s $ number of matches. 4 s 5 e e $ 3 $ 1 2 0 nonsense$ 012345678 String Matching ● Claim: After $ e s spending O(m) time n o e preprocessing T$, n $ s can find all 8 n o s $ n s e e matches of a string n s 7 e s n P in time O(n + z), s 6 e $ e $ n $ where z is the n e s $ number of matches.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    76 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us