Suffix Trees
Total Page:16
File Type:pdf, Size:1020Kb
CSE 549: Suffix Tries & Suffix Trees All slides in this lecture not marked with “*” of Ben Langmead. KMP is great, but |T| = m |P| = n (note: m,n are opposite from previous lecture) Without pre- Given pre- Without pre- Given pre- processing processing processing processing (KMP) (KMP) (ST) (ST) Find an O(m+n) O(m) O(m+n) O(n) occurrence of P Find all occurrences of O(m+n) O(m) O(m + n + k) O(n+k) P Find an occurrence of O(l(m+n)) O(l(m)) O( m + ln) O(ln) P1,…,Pl If the text is constant over many patterns, pre-processing the text rather than the pattern is better (and allows other efficient queries). * Tries A trie (pronounced “try”) is a rootedtree representing tree representing a collection a collection of strings of with strings with one node per common prefx Smallest tree such that: Each edge is labeled with a character c ∈ Σ A node has at most one outgoing edge labeled c, for c ∈ Σ Each key is “spelled out” along some path starting at the root Natural way to represent either a set or a map where keys are strings This structure is also known as a Σ-tree Tries: example Represent this map with a trie: i Key Value instant 1 n s t internal 2 internet 3 t e a r The smallest tree such that: n n Each edge is labeled with a character c ∈ Σ t a e A node has at most one outgoing edge 1 labeled c, for c ∈ Σ l t Each key is “spelled out” along some path 2 3 starting at the root Tries: example i Checking for presence of a key P, where n = | P |, is ???O(n ) time n t s If total length of all keys is N, trie has O ???(N ) nodes t e a r What about | Σ | ? n n Depends how we represent outgoing t a e edges. If we don’t assume | Σ | is a 1 small constant, it shows up in one or l t both bounds. 2 3 Tries: another example We can index T with a trie. The trie maps substrings to ofsets where they occur c 4 g 8 t ac 4 a ag 8 14 at 14 c c cc 12 t 12, 2 cc 2 root: g ct 6 6 gt 18 t t gt 0 18, 0 ta 10 a tt 16 10 t 16 Indexing with sufxes SomeUntil indices now, (e.g. our indexes the inverted have beenindex) based are based on extracting on extracting substrings substrings from T from T A very diferent approach is to extract sufxes from T. This will lead us to some interesting and practical index data structures: 6 $ $ B A N A N A 5 A$ A $ B A N A N 3 ANA$ A N A $ B A N 1 ANANA$ A N A N A $ B 0 BANANA$ B A N A N A $ 4 NA$ N A $ B A N A 2 NANA$ N A N A $ B A Sufx Trie Sufx Tree Sufx Array FM Index Trie Definitions A Σ-tree (trie) is a rooted tree where each edge is labeled with a single character c ϵ Σ, such that no node has two outgoing edges labeled with the same character. • for a node v in T, depth(v) or node-depth(v) is the distance from v to the root. • node-depth(r) = 0 • string(v) = concatenation of all characters on the path r ⤳ v • string-depth(v) = |string(v)| (note: string-depth(v) ≥ node-depth(v)) • for a string x, if ∃ node v with string(v) = x, we say node(x) = v • T displays string x if ∃ node v and string y such that xy = string(v) • words(T) = { x | T displays x} • A suffix trie of string s is a Σ-tree such that words(T) = {s’ | s’ is a substring of s} • An internal/leaf edge leads to an internal/leaf node Defs. from: http://profs.sci.univr.it/~liptak/ALBioinfo/files/sequence_analysis.pdf * Sufx trie Build a trie containing all sufxes of a text T T: GTTATAGCTGATCGCGGCGTAGCGG$ GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ TATAGCTGATCGCGGCGTAGCGG$ ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ GCTGATCGCGGCGTAGCGG$ CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ GATCGCGGCGTAGCGG$ ATCGCGGCGTAGCGG$m(m+1)/2 TCGCGGCGTAGCGG$ CGCGGCGTAGCGG$chars GCGGCGTAGCGG$ CGGCGTAGCGG$ GGCGTAGCGG$ GCGTAGCGG$ CGTAGCGG$ GTAGCGG$ TAGCGG$ AGCGG$ GCGG$ CGG$ GG$ G$ $ Sufx trie First add special terminal character $ to the end of T $ is a character that does not appear elsewhere in T, and we defne it to be less than other characters (for DNA: $ < A < C < G < T) $ enforces a rule we’re all used to using: e.g. “as” comes before “ash” in the dictionary. $ also guarantees no sufx is a prefx of any other sufx. T: GTTATAGCTGATCGCGGCGTAGCGG$ GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ TATAGCTGATCGCGGCGTAGCGG$ ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ GCTGATCGCGGCGTAGCGG$ CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ GATCGCGGCGTAGCGG$ ATCGCGGCGTAGCGG$ TCGCGGCGTAGCGG$ CGCGGCGTAGCGG$ GCGGCGTAGCGG$ CGGCGTAGCGG$ GGCGTAGCGG$ GCGTAGCGG$ CGTAGCGG$ GTAGCGG$ TAGCGG$ AGCGG$ GCGG$ CGG$ GG$ G$ $ Sufx trie a b $ Shortest T: abaaba T$: abaaba$ (non-empty) sufx a b $ a Each path from root to leaf represents a sufx; each sufx is represented by some b a a $ path from root to leaf a a $ b Would this still be the case if we hadn’t added $? $ b a a $ $ Longest sufx Sufx trie a b T: abaaba Each path from root to leaf represents a a b a sufx; each sufx is represented by some path from root to leaf b a a Would this still be the case if we hadn’t added $? No a a b b a a Sufx trie a b $ We can think of nodes as having labels, where the label spells out characters on the a b $ a path from the root to the node b a a $ a a $ b baa $ b a a $ $ Sufx trie a b $ How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T. a a $ b S = baa Yes, it’s a substring Start at the root and follow the edges labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie a b $ How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T. a a $ b S = baa Yes, it’s a substring Start at the root and follow the edges labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie a b $ How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T. a a $ b Start at the root and follow the edges labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T S = abaaba Yes, it’s a substring If we exhaust S without falling of, S is a substring of T $ Sufx trie a b $ How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T. a a $ b Start at the root and follow the edges labeled with the characters of S x $ b a S = baabb If we “fall of” the trie -- i.e. there is no No, not a substring outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie a b $ How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but additionally check whether the fnal node in b a a $ the walk has an outgoing edge labeled $ a a $ b S = baa Not a sufx $ b a a $ $ Sufx trie a b $ How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but additionally check whether the fnal node in b a a $ the walk has an outgoing edge labeled $ a a $ b S = baa Not a sufx $ b a a $ $ Sufx trie a b $ How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but b a a $ additionally check whether the fnal node in S = aba the walk has an outgoing edge labeled $ Is a sufx a a $ b $ b a a $ $ Sufx trie a b $ How do we count the number of times a string S occurs as a substring of T? a b $ a Follow path corresponding to S.