Suffix Trees

Suffix Trees

Suffix trees Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me ([email protected]) and tell me brie!y how you’re using them. For original Keynote "les, email me. Suffix trie: making it smaller T = abaaba$ Idea 1: Coalesce non-branching paths into a single edge with a string label $ aba$ Reduces # nodes, edges, guarantees internal nodes have >1 child Suffix tree T = abaaba$ a ba $ With respect to m: How many leaves? m $ ba $ How many non-leaf nodes? ≤ m - 1 aba$ ≤ 2m -1 nodes total, or O(m) nodes $ aba$ aba$ No: total length of edge Is the total size O(m) now? labels is quadratic in m Suffix tree T = abaaba$ Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (offset, length) pairs with respect to T. T = abaaba$ (6, 1) a $ (0, 1) ba (1, 2) $ (6, 1) ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) $ aba$ (6, 1) aba$ (3, 4) Space required for suffix tree is now O(m) Suffix tree: leaves hold offsets T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 0 Suffix tree: labels Again, each node’s label equals the T = abaaba$ concatenated edge labels from the root to (6, 1) the node. These aren’t stored explicitly. (0, 1) (1, 2) 6 (6, 1) (1, 2) (6, 1) Label = “ba” 5 (3, 4) (3, 4) 4 (6, 1) 1 (3, 4) 3 2 Label = “aaba$” 0 Suffix tree: labels Because edges can have string labels, we T = abaaba$ must distinguish two notions of “depth” (6, 1) • Node depth: how many edges we must (0, 1) (1, 2) 6 follow from the root to reach the node (6, 1) (1, 2) (6, 1) • Label depth: total length of edge labels 5 (3, 4) for edges on path from root to node (3, 4) 4 (6, 1) 1 (3, 4) 3 2 0 Suffix tree: space caveat Minor point: T = abaaba$ We say the space taken by the edge labels is (6, 1) (0, 1) O(m), because we keep 2 integers per edge and (1, 2) 6 there are O(m) edges (6, 1) (1, 2) (6, 1) To store one such integer, we need enough bits 5 (3, 4) to distinguish m positions in T, i.e. ceil(log2 m) (3, 4) 4 (6, 1) bits. We usually ignore this factor, since 64 bits is 1 plenty for all practical purposes. (3, 4) 3 2 Similar argument for the pointers / references 0 used to distinguish tree nodes. Suffix tree: building Naive method 1: build a suffix trie, then coalesce non-branching paths and relabel (6, 1) edges (0, 1) (1, 2) 6 Naive method 2: build a single-edge tree (6, 1) representing only the longest suffix, then (1, 2) (6, 1) augment to include the 2nd-longest, then 5 (3, 4) (3, 4) 4 augment to include 3rd-longest, etc (6, 1) 1 (3, 4) 3 2 Both are O(m ) time, but "rst uses 2 O(m2) space while second uses O(m) 0 Naive method 2 is described in Gus"eld 5.4 Suffix tree: implementation class SuffixTree(object): class Node(object): def __init__(self, lab): 2 self.lab = lab # label on path leading to this node O(m ) time, O(m) space self.out = {} # outgoing edges; maps characters to nodes def __init__(self, s): """ Make suffix tree, without suffix links, from s in quadratic time and linear space """ s += '$' self.root = self.Node(None) self.root.out[s[0]] = self.Node(s) # trie for just longest suf Make 2-node tree for longest suffix # add the rest of the suffixes, from longest to shortest for i in xrange(1, len(s)): # start at root; we’ll walk down as far as we can go cur = self.root j = i while j < len(s): Add rest of suffixes from if s[j] in cur.out: child = cur.out[s[j]] long to short, adding 1 lab = child.lab # Walk along edge until we exhaust edge label or or 2 nodes for each # until we mismatch k = j+1 while k-j < len(lab) and s[k] == lab[k-j]: k += 1 if k-j == len(lab): cur = child # we exhausted the edge Most complex case: j = k ... ... else: # we fell off in middle of edge cExist, cNew = lab[k-j], s[k] # create “mid”: new node bisecting edge mid = self.Node(lab[:k-j]) mid.out[cNew] = self.Node(s[k:]) u # original child becomes mid’s child mid.out[cExist] = child uv # original child’s label is curtailed w$ child.lab = lab[k-j:] v # mid becomes new child of original parent cur.out[s[j]] = mid else: ... ... # Fell off tree at a node: make new edge hanging off it cur.out[s[j]] = self.Node(s[j:]) Suffix tree: implementation (still in class SuffixTree) def followPath(self, s): """ Follow path given by s. If we fall off tree, return None. If we finish mid-edge, return (node, offset) where 'node' is child and 'offset' is label offset. If we finish on a node, return (node, None). """ cur = self.root i = 0 while i < len(s): c = s[i] followPath if c not in cur.out: : Given a string, walk down return (None, None) # fell off at a node child = cur.out[s[i]] corresponding path. Return a special lab = child.lab j = i+1 value if we fall off, or a description of while j-i < len(lab) and j < len(s) and s[j] == lab[j-i]: j += 1 where we end up otherwise. if j-i == len(lab): cur = child # exhausted edge i = j elif j == len(s): return (child, j-i) # exhausted query string in middle of edge else: return (None, None) # fell off in the middle of the edge return (cur, None) # exhausted query string at internal node def hasSubstring(self, s): """ Return true iff s appears as a substring """ Has substring? Return true iff node, off = self.followPath(s) return node is not None followPath didn’t fall off. def hasSuffix(self, s): """ Return true iff s is a suffix """ node, off = self.followPath(s) if node is None: Has suffix? Return true iff return False # fell off the tree if off is None: followPath didn’t fall off and we # finished on top of a node return '$' in node.out ended just above a “$”. else: # finished at offset 'off' within an edge leading to 'node' return node.lab[off] == '$' Suffix tree: implementation Python example here: http://nbviewer.ipython.org/6665861 Suffix tree: actual growth ● ● 2m ●●● ●●● 1000 ●●● Built suffix trees for the #rst ● ●●● actual ●●● ●●● ● ●●● m ●●● ●●● 500 pre#xes of the lambda ●●● ●●● ●●● ●●● ●●● phage virus genome ●●● ●●● ●●● ●● ●●● ●●●● 800 ●●● ●●● ●●● ●●●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● Black curve shows # nodes ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●● increasing with pre#x length ●●● ●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● 600 ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●●● ●●● ●●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● Compare with suffix trie: ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● 400 ● ● ●●● ●●● ●●●●●● # suffix tree nodes ● ● ●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ● ●● ●● ●●● ●●● m^2 ●● ●● ●●● ●●● ●● ●● ●● ●●● ● ●● ●● ●●● ●●● 250000 ● ● ● actual ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ● ●● ●● ●●● ●●● m ●● ●● ●● ●●● ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● 200000 ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●●● ●●●● ●● ●● ● ●●● ●● ●● ●● ●●● 200 ● ● ● ●● ●● ●● ●●● ●● ●● ●● ●●●● ●● ●● ●● ●●● ●● ●● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● 150000 ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●● ●●● ●● ●●● ●● ●● ●●● ●● ●●● ● ●● ●●● ●● ●●● ●● ●●● ●●● ●● ●●● ●● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ●●● ● ●● ●●● ●● ●●● ●●●●●●●● ●●● ●●● ●●●●●●● ●● ●●●● ●●●●●●●● ●●● ●●● ●●●●● # suffix trie nodes ●● ●●● ●●●●●●● ●● ●●●● ●●●●●●●● 100000 ●●● ●●● ●●●●●● ●●● ●●●● ●●●●● ●●● ●●●● ●●●● ●●● ●●●● 0 ● ●●● ●●●● ●●● ●●●● 123 K ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●●●● ●●● ●●●●● ●●●● ●●●●● ●●●● ●●●●● 50000 ●● ●●●● ●●●● ●●●●● ●●●● ●●●●●● nodes ●●● ●●●● ●●●● ●●●●●● ●●●●● ●●●●●● ●●●●● ●●●●●●● 0 100 200 300 400 500 ●●●●● ●●●●●●● ●●●●● ●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● Length prefix over which suffix tree was built 0 100 200 300 400 500 Length prefix over which suffix trie was built Suffix tree: building Method of choice: Ukkonen’s algorithm Ukkonen, Esko. "On-line construction of suffix trees." Algorithmica 14.3 (1995): 249-260. O(m) time and space Has online property: if T arrives one character at a time, algorithm efficiently updates suffix tree upon each arrival We won’t cover it here; see Gus"eld Ch. 6 for details Suffix tree How do we check whether a string S is a substring of T? a ba $ Essentially same procedure as for 6 suffix trie, except we have to deal with S = baa coalesced edges aba$ ba $ aba$ $ Yes, it’s a 2 5 1 substring4 aba$ $ 0 3 Suffix tree How do we check whether a string S is a suffix of T? a ba $ Essentially same procedure as for 6 suffix trie, except we have to deal with coalesced edges aba$ ba $ aba$ $ S = aba 2 Yes,5 it’s a su1 ffix 4 aba$ $ 0 3 Suffix tree How do we count the number of times a string S occurs as a substring of T? a ba $ 6 Same procedure as for suffix trie aba$ ba $ aba$ $ 2 S = 5aba 1 4 Occurs twice aba$ $ 0 3 Suffix tree: applications With suffix tree of T, we can "nd all matches of P to T. Let k = # matches. E.g., P = ab, T = abaaba$ Step 1: walk down ab path a $ O(n) ba If we “fall off” there are no matches 6 $ baba $ 5 aba$ Step 2: visit all leaf nodes below 4 O(k) $ aba$ Report each leaf offset as match offset 1 aba$ 3 2 O(n + k) time 0 abaaba ab ab Suffix tree application: "nd long common substrings Dots are maximal unique matches (MUMs), a kind of long substring shared by two sequences Red = match was between like strands, green = different strands Axes show different strains of Helicobacter pylori, a bacterium found in the stomach and associated with gastric ulcers Suffix tree application: #nd longest common substring To "nd the longest common substring (LCS) of X and Y, make a new string X # Y $ where # and $ are both terminal symbols.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    26 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us