Suffix Trees

Suffix Trees Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me ([email protected]). Suffix trie We saw the suffix trie, but m(m+1)/2 actual we also saw its size grows 120000 2m+2 quadratically with the length of the string 100000 · 9 Human genome is 3 10 80000 bases long. 60000 If m = 3 ·109, m2 is way # suffix trie nodes huge, far beyond what we 40000 can store in memory 20000 0 0 100 200 300 400 500 Length prefix over which suffix trie was built Suffix trie: making it smaller T = abaaba$ Suffix trie: making it smaller T = abaaba$ Idea 1: Coalesce non-branching paths into a single edge with a string label $ aba$ Reduces # nodes, edges, guarantees non-leaf nodes have >1 ? child Suffix trie: making it smaller T = abaaba$ a ba $ $ ba $ aba$ $ aba$ aba$ Suffix tree T = abaaba$ | T | = m a ba $ # leaves? m $ ba $ # non-leaf nodes (bound)? ≤ m - 1 aba$ ≤ 2m -1 nodes total — O(m) $ aba$ aba$ No: total length of edge Is total size O(m) now? labels grows with m2 Suffix tree Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (offset, length) pairs with respect to T. T = abaaba$ (6, 1) a $ (0, 1) ba (1, 2) $ (6, 1) ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) $ aba$ (6, 1) aba$ (3, 4) Suffix tree Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (offset, length) pairs with respect to T. T = abaaba$ (6, 1) a $ (0, 1) ba (1, 2) $ (6, 1) ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) $ aba$ (6, 1) aba$ (3, 4) Space is now O(m) Suffix trie was O(m2)! Suffix tree: leaves hold offsets T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 0 Suffix tree: leaves hold offsets T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 Label = “aaba$” 0 Suffix tree: leaves hold offsets Offset 2 T = abaaba$ T = abaaba$ (6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6 (6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 Label = “aaba$” 0 Suffix tree: labels T = abaaba$ Two notions of depth: (1, 2) • Node depth: # edges from root to node 6 • Label depth: total length of edge labels 5 (3, 4) from root to node 4 1 3 2 0 Node depth = 2 Label depth = 2 + 4 = 6 Suffix tree: building Method 1: build suffix trie, coalesce non- branching paths, relabel edges 2 2 (6, 1) O(m ) time, O(m ) space (0, 1) (1, 2) 6 (6, 1) Method 2: build single-edge tree (1, 2) (6, 1) 5 (3, 4) representing longest suffix, augment to (3, 4) 4 include the 2nd-longest, augment to (6, 1) 1 include 3rd-longest, etc (Gusfield 5.4) (3, 4) 3 2 2 O(m ) time, O(m) space 0 Suffix tree: implementation http://bit.ly/CG_SuffixTree Suffix tree: building Canonical method: Ukkonen’s algorithm Ukkonen, Esko. "On-line construction of suffix trees." Algorithmica 14.3 (1995): 249-260. O(m) time and space! Won’t cover it in class; see Gusfield Ch. 6 for details Canonical algorithm for O(m) time & space suffix tree construction Suffix tree: actual growth ● ● 2m ●●● ●●● 1000 ●●● Built suffix trees for the first ● ●●● actual ●●● ●●● ● ●●● m ●●● ●●● 500 prefixes of the lambda ●●● ●●● ●●● ●●● ●●● phage virus genome ●●● ●●● ●●● ●● ●●● ●●●● 800 ●●● ●●● ●●● ●●●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● Black curve shows # nodes ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● increasing with prefix length ●●● ●●●● ●●● ●●● ●●● ●●●● 600 ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●●● ●●● ●●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● Remember suffix trie plot: ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● 400 ● ● ●●● ●●● ●●●●●● # suffix tree nodes ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● m(m+1)/2 ●●● ●●●● ●●●●● ●●● ●●●● ●●●●●● actual ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 120000 2m+2 ●●● ●●● ●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 100000 ● ●● ●● 123 K ●●● ●●● ●●●●●● 200 ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● 80000 ●● ●●● ●●●●● nodes ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●●● ●●●●●● ●●●●●●●●●●●●● 60000 ●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● # suffix trie nodes ●●●●●●● ●●●●●● 0 ● 40000 20000 0 100 200 300 400 500 0 Length prefix over which suffix tree was built 0 100 200 300 400 500 Length prefix over which suffix trie was built Suffix tree: actual growth ● ● 2m ●●● ●●● 1000 ●●● Built suffix trees for the first ● ●●● actual ●●● ●●● ● ●●● m ●●● ●●● 500 prefixes of the lambda ●●● ●●● ●●● ●●● ●●● phage virus genome ●●● ●●● ●●● ●● ●●● ●●●● 800 ●●● ●●● ●●● ●●●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● Black curve shows # nodes ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● increasing with prefix length ●●● ●●●● ●●● ●●● ●●● ●●●● 600 ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●●● ●●● ●●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● Remember suffix trie plot: ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● 400 ● ● ●●● ●●● ●●●●●● # suffix tree nodes ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● m(m+1)/2 ●●● ●●●● ●●●●● ●●● ●●●● ●●●●●● actual ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 120000 2m+2 ●●● ●●● ●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 100000 ● ●● ●● 123 K ●●● ●●● ●●●●●● 200 ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● 80000 ●● ●●● ●●●●● nodes ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●●● ●●●●●● ●●●●●●●●●●●●● 60000 ●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● # suffix trie nodes ●●●●●●● ●●●●●● 0 ● 40000 20000 0 100 200 300 400 500 0 Length prefix over which suffix tree was built 0 100 200 300 400 500 Length prefix over which suffix trie was built Suffix trie Suffix tree >100K nodes <1K nodes ● ● 2m ●●● ●●● m(m+1)/2 1000 ● ●●● ● actual ●●● ●●● actual ●●● ● ●●● m ●●● 120000 2m+2 ●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●● ●●● ●●●● 800 ●● ●● ●●● ●●●● ●●● ●●●● ●●● ●●●● 100000 ● ● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●● 80000 ●● ●● ●●● ●●●● 600 ● ● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●● ●●● ●●●● ● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 60000 ●● ●●● ●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● 400 ● ●● # suffix trie nodes ●● ●●● ●●●●● # suffix tree nodes ● ● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● 40000 ●●● ●●●● ●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● 200 ● ● ● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● 20000 ●● ●● ●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●●●● ●●●●●● ●●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● 0 ●●●●●●● ●●●●●● 0 ● 0 100 200 300 400 500 0 100 200 300 400 500 Length prefix over which suffix trie was built Length prefix over which suffix tree was built Suffix tree How do we check whether a string S is a substring of T? a ba $ S = baa Same procedure as for suffix trie, but $ ba $ Yes, it’s a we have to deal with coalesced edges aba$ substring $ aba$ aba$ Suffix tree How do we check whether a string S is a suffix of T? a ba $ Same procedure as for suffix trie, but $ ba $ we have to deal with coalesced edges aba$ $ aba$S = aba Yes, it’s a suffix aba$ Suffix tree How do we check whether a string S is a suffix of T? a ba $ Same procedure as for suffix trie, but $ ba $ we have to deal with coalesced edges aba$ $ aba$ aba$ S = abaaba Yes, it’s a suffix Suffix tree How do we check whether a string S is a suffix of T? a ba $ Same procedure as for suffix trie, but $ ba $ we have to deal with coalesced edges S = ab aba$ No, not a suffix $ aba$ aba$ Suffix tree How do we count the number of times a string S occurs as a substring of T? a ba $ Same procedure as for suffix trie $ ba $ aba$ S = aba Occurs twice $ aba$ aba$ Suffix tree We can also count or find all the matches of P to T. Let k = # matches. E.g., P = ab, T = abaaba$ Step 1: walk down ab path a $ O(n) ba If we “fall off” there are no matches 6 $ baba $ 5 aba$ Step 2: visit all leaf nodes below 4 O(k) $ aba$ Report each leaf offset as match offset 1 aba$ 3 # leaves in subtree is is k, 2 0 # non-leaves is ≤ k-1 abaaba O(n + k) time overall ab ab Suffix tree: some bounds Suffix tree Time: Does P occur? O(n) Time: Count k occurrences of P O(n + k) Time: Report k locations of P O(n + k) Space O(m) m = | T |, n = | P |, k = # occurrences of P in T Suffix tree: long common substrings Dots are maximal unique matches (MUMs), a kind of long substring shared by two sequences Red = match between like strands green = different strands Axes are strains of Helicobacter pylori, bacterium found in stomach & associated with ulcers Suffix tree application: find longest common substring Find longest common substring (LCS) of X and Y, make a new string X#Y$ where #, $ are both terminal symbols. Build a suffix tree for X#Y$. X = xabxa Y = babxba X#Y$ = xabxa#babxba$ a x #babxba$ b $ 5 12 #babxba$ bx $ a ba$ a x 4 11 9 a#babxba$

Load more