Suffix Trees

Suffix Trees

Suffix Trees Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me ([email protected]). Suffix trie We saw the suffix trie, but m(m+1)/2 actual we also saw its size grows 120000 2m+2 quadratically with the length of the string 100000 · 9 Human genome is 3 10 80000 bases long. 60000 If m = 3 ·109, m2 is way # suffix trie nodes huge, far beyond what we 40000 can store in memory 20000 0 0 100 200 300 400 500 Length prefix over which suffix trie was built Suffix trie: making it smaller T = abaaba$ Suffix trie: making it smaller T = abaaba$ Idea 1: Coalesce non-branching paths into a single edge with a string label $ aba$ Reduces # nodes, edges, guarantees non-leaf nodes have >1 ? child Suffix trie: making it smaller T = abaaba$ a ba $ $ ba $ aba$ $ aba$ aba$ Suffix tree T = abaaba$ | T | = m a ba $ # leaves? m $ ba $ # non-leaf nodes (bound)? ≤ m - 1 aba$ ≤ 2m -1 nodes total — O(m) $ aba$ aba$ No: total length of edge Is total size O(m) now? labels grows with m2 Suffix tree Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (offset, length) pairs with respect to T. T = abaaba$ (6, 1) a $ (0, 1) ba (1, 2) $ (6, 1) ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) $ aba$ (6, 1) aba$ (3, 4) Suffix tree Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (offset, length) pairs with respect to T. T = abaaba$ (6, 1) a $ (0, 1) ba (1, 2) $ (6, 1) ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) $ aba$ (6, 1) aba$ (3, 4) Space is now O(m) Suffix trie was O(m2)! Suffix tree: leaves hold offsets T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 0 Suffix tree: leaves hold offsets T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 Label = “aaba$” 0 Suffix tree: leaves hold offsets Offset 2 T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 Label = “aaba$” 0 Suffix tree: labels T = abaaba$ Two notions of depth: (1, 2) • Node depth: # edges from root to node 6 • Label depth: total length of edge labels 5 (3, 4) from root to node 4 1 3 2 0 Node depth = 2 Label depth = 2 + 4 = 6 Suffix tree: building Method 1: build suffix trie, coalesce non- branching paths, relabel edges 2 2 (6, 1) O(m ) time, O(m ) space (0, 1) (1, 2) 6 (6, 1) Method 2: build single-edge tree (1, 2) (6, 1) 5 (3, 4) representing longest suffix, augment to (3, 4) 4 include the 2nd-longest, augment to (6, 1) 1 include 3rd-longest, etc (Gusfield 5.4) (3, 4) 3 2 2 O(m ) time, O(m) space 0 Suffix tree: implementation http://bit.ly/CG_SuffixTree Suffix tree: building Canonical method: Ukkonen’s algorithm Ukkonen, Esko. "On-line construction of suffix trees." Algorithmica 14.3 (1995): 249-260. O(m) time and space! Won’t cover it in class; see Gusfield Ch. 6 for details Canonical algorithm for O(m) time & space suffix tree construction Suffix tree: actual growth ● ● 2m ●●● ●●● 1000 ●●● Built suffix trees for the first ● ●●● actual ●●● ●●● ● ●●● m ●●● ●●● 500 prefixes of the lambda ●●● ●●● ●●● ●●● ●●● phage virus genome ●●● ●●● ●●● ●● ●●● ●●●● 800 ●●● ●●● ●●● ●●●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● Black curve shows # nodes ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● increasing with prefix length ●●● ●●●● ●●● ●●● ●●● ●●●● 600 ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●●● ●●● ●●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● Remember suffix trie plot: ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● 400 ● ● ●●● ●●● ●●●●●● # suffix tree nodes ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● m(m+1)/2 ●●● ●●●● ●●●●● ●●● ●●●● ●●●●●● actual ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 120000 2m+2 ●●● ●●● ●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 100000 ● ●● ●● 123 K ●●● ●●● ●●●●●● 200 ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● 80000 ●● ●●● ●●●●● nodes ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●●● ●●●●●● ●●●●●●●●●●●●● 60000 ●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● # suffix trie nodes ●●●●●●● ●●●●●● 0 ● 40000 20000 0 100 200 300 400 500 0 Length prefix over which suffix tree was built 0 100 200 300 400 500 Length prefix over which suffix trie was built Suffix tree: actual growth ● ● 2m ●●● ●●● 1000 ●●● Built suffix trees for the first ● ●●● actual ●●● ●●● ● ●●● m ●●● ●●● 500 prefixes of the lambda ●●● ●●● ●●● ●●● ●●● phage virus genome ●●● ●●● ●●● ●● ●●● ●●●● 800 ●●● ●●● ●●● ●●●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● Black curve shows # nodes ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● increasing with prefix length ●●● ●●●● ●●● ●●● ●●● ●●●● 600 ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●●● ●●● ●●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● Remember suffix trie plot: ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● 400 ● ● ●●● ●●● ●●●●●● # suffix tree nodes ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● m(m+1)/2 ●●● ●●●● ●●●●● ●●● ●●●● ●●●●●● actual ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 120000 2m+2 ●●● ●●● ●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 100000 ● ●● ●● 123 K ●●● ●●● ●●●●●● 200 ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● 80000 ●● ●●● ●●●●● nodes ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●●● ●●●●●● ●●●●●●●●●●●●● 60000 ●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● # suffix trie nodes ●●●●●●● ●●●●●● 0 ● 40000 20000 0 100 200 300 400 500 0 Length prefix over which suffix tree was built 0 100 200 300 400 500 Length prefix over which suffix trie was built Suffix trie Suffix tree >100K nodes <1K nodes ● ● 2m ●●● ●●● m(m+1)/2 1000 ● ●●● ● actual ●●● ●●● actual ●●● ● ●●● m ●●● 120000 2m+2 ●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●● ●●● ●●●● 800 ●● ●● ●●● ●●●● ●●● ●●●● ●●● ●●●● 100000 ● ● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●● 80000 ●● ●● ●●● ●●●● 600 ● ● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●● ●●● ●●●● ● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 60000 ●● ●●● ●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● 400 ● ●● # suffix trie nodes ●● ●●● ●●●●● # suffix tree nodes ● ● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● 40000 ●●● ●●●● ●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 200 ● ● ● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● 20000 ●● ●● ●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●●● ●●●●●● ●●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● 0 ●●●●●●● ●●●●●● 0 ● 0 100 200 300 400 500 0 100 200 300 400 500 Length prefix over which suffix trie was built Length prefix over which suffix tree was built Suffix tree How do we check whether a string S is a substring of T? a ba $ S = baa Same procedure as for suffix trie, but $ ba $ Yes, it’s a we have to deal with coalesced edges aba$ substring $ aba$ aba$ Suffix tree How do we check whether a string S is a suffix of T? a ba $ Same procedure as for suffix trie, but $ ba $ we have to deal with coalesced edges aba$ $ aba$S = aba Yes, it’s a suffix aba$ Suffix tree How do we check whether a string S is a suffix of T? a ba $ Same procedure as for suffix trie, but $ ba $ we have to deal with coalesced edges aba$ $ aba$ aba$ S = abaaba Yes, it’s a suffix Suffix tree How do we check whether a string S is a suffix of T? a ba $ Same procedure as for suffix trie, but $ ba $ we have to deal with coalesced edges S = ab aba$ No, not a suffix $ aba$ aba$ Suffix tree How do we count the number of times a string S occurs as a substring of T? a ba $ Same procedure as for suffix trie $ ba $ aba$ S = aba Occurs twice $ aba$ aba$ Suffix tree We can also count or find all the matches of P to T. Let k = # matches. E.g., P = ab, T = abaaba$ Step 1: walk down ab path a $ O(n) ba If we “fall off” there are no matches 6 $ baba $ 5 aba$ Step 2: visit all leaf nodes below 4 O(k) $ aba$ Report each leaf offset as match offset 1 aba$ 3 # leaves in subtree is is k, 2 0 # non-leaves is ≤ k-1 abaaba O(n + k) time overall ab ab Suffix tree: some bounds Suffix tree Time: Does P occur? O(n) Time: Count k occurrences of P O(n + k) Time: Report k locations of P O(n + k) Space O(m) m = | T |, n = | P |, k = # occurrences of P in T Suffix tree: long common substrings Dots are maximal unique matches (MUMs), a kind of long substring shared by two sequences Red = match between like strands green = different strands Axes are strains of Helicobacter pylori, bacterium found in stomach & associated with ulcers Suffix tree application: find longest common substring Find longest common substring (LCS) of X and Y, make a new string X#Y$ where #, $ are both terminal symbols. Build a suffix tree for X#Y$. X = xabxa Y = babxba X#Y$ = xabxa#babxba$ a x #babxba$ b $ 5 12 #babxba$ bx $ a ba$ a x 4 11 9 a#babxba$

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    35 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us