CSE 549: Suffix Tries & Suffix Trees
All slides in this lecture not marked with “*” of Ben Langmead. KMP is great, but |T| = m |P| = n (note: m,n are opposite from previous lecture) Without pre- Given pre- Without pre- Given pre- processing processing processing processing (KMP) (KMP) (ST) (ST)
Find an O(m+n) O(m) O(m+n) O(n) occurrence of P
Find all occurrences of O(m+n) O(m) O(m + n + k) O(n+k) P
Find an occurrence of O(l(m+n)) O(l(m)) O( m + ln) O(ln) P1,…,Pl
If the text is constant over many patterns, pre-processing the text rather than the pattern is better (and allows other efficient queries). * Tries
A trie (pronounced “try”) is a rootedtree representing tree representing a collection a collection of strings of with strings with one node per common prefx
Smallest tree such that:
Each edge is labeled with a character c ∈ Σ
A node has at most one outgoing edge labeled c, for c ∈ Σ
Each key is “spelled out” along some path starting at the root
Natural way to represent either a set or a map where keys are strings
This structure is also known as a Σ-tree Tries: example
Represent this map with a trie: i Key Value instant 1 n s t internal 2 internet 3 t e a r The smallest tree such that: n n Each edge is labeled with a character c ∈ Σ t a e A node has at most one outgoing edge 1 labeled c, for c ∈ Σ l t Each key is “spelled out” along some path 2 3 starting at the root Tries: example
i Checking for presence of a key P, where n = | P |, is ???O(n ) time n t s If total length of all keys is N, trie has O ???(N ) nodes t e a r What about | Σ | ? n n Depends how we represent outgoing t a e edges. If we don’t assume | Σ | is a 1 small constant, it shows up in one or l t both bounds. 2 3 Tries: another example
We can index T with a trie. The trie maps substrings to ofsets where they occur
c 4 g 8 t ac 4 a ag 8 14 at 14 c c cc 12 t 12, 2 cc 2 root: g ct 6 6 gt 18 t t gt 0 18, 0 ta 10 a tt 16 10 t 16 Indexing with sufxes
SomeUntil indices now, (e.g. our indexes the inverted have beenindex) based are based on extracting on extracting substrings substrings from T from T
A very diferent approach is to extract sufxes from T. This will lead us to some interesting and practical index data structures:
6 $ $ B A N A N A 5 A$ A $ B A N A N 3 ANA$ A N A $ B A N 1 ANANA$ A N A N A $ B 0 BANANA$ B A N A N A $ 4 NA$ N A $ B A N A 2 NANA$ N A N A $ B A
Sufx Trie Sufx Tree Sufx Array FM Index Trie Definitions A Σ-tree (trie) is a rooted tree where each edge is labeled with a single character c ϵ Σ, such that no node has two outgoing edges labeled with the same character. • for a node v in T, depth(v) or node-depth(v) is the distance from v to the root.
• node-depth(r) = 0
• string(v) = concatenation of all characters on the path r ⤳ v
• string-depth(v) = |string(v)| (note: string-depth(v) ≥ node-depth(v))
• for a string x, if ∃ node v with string(v) = x, we say node(x) = v
• T displays string x if ∃ node v and string y such that xy = string(v)
• words(T) = { x | T displays x}
• A suffix trie of string s is a Σ-tree such that words(T) = {s’ | s’ is a substring of s}
• An internal/leaf edge leads to an internal/leaf node
Defs. from: http://profs.sci.univr.it/~liptak/ALBioinfo/files/sequence_analysis.pdf * Sufx trie
Build a trie containing all sufxes of a text T
T: GTTATAGCTGATCGCGGCGTAGCGG$ GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ TATAGCTGATCGCGGCGTAGCGG$ ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ GCTGATCGCGGCGTAGCGG$ CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ GATCGCGGCGTAGCGG$ ATCGCGGCGTAGCGG$m(m+1)/2 TCGCGGCGTAGCGG$ CGCGGCGTAGCGG$chars GCGGCGTAGCGG$ CGGCGTAGCGG$ GGCGTAGCGG$ GCGTAGCGG$ CGTAGCGG$ GTAGCGG$ TAGCGG$ AGCGG$ GCGG$ CGG$ GG$ G$ $ Sufx trie
First add special terminal character $ to the end of T
$ is a character that does not appear elsewhere in T, and we defne it to be less than other characters (for DNA: $ < A < C < G < T)
$ enforces a rule we’re all used to using: e.g. “as” comes before “ash” in the dictionary. $ also guarantees no sufx is a prefx of any other sufx.
T: GTTATAGCTGATCGCGGCGTAGCGG$ GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ TATAGCTGATCGCGGCGTAGCGG$ ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ GCTGATCGCGGCGTAGCGG$ CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ GATCGCGGCGTAGCGG$ ATCGCGGCGTAGCGG$ TCGCGGCGTAGCGG$ CGCGGCGTAGCGG$ GCGGCGTAGCGG$ CGGCGTAGCGG$ GGCGTAGCGG$ GCGTAGCGG$ CGTAGCGG$ GTAGCGG$ TAGCGG$ AGCGG$ GCGG$ CGG$ GG$ G$ $ Sufx trie
a b $ Shortest T: abaaba T$: abaaba$ (non-empty) sufx a b $ a
Each path from root to leaf represents a
sufx; each sufx is represented by some b a a $ path from root to leaf
a a $ b Would this still be the case if we hadn’t added $?
$ b a
a $
$
Longest sufx Sufx trie
a b T: abaaba
Each path from root to leaf represents a a b a sufx; each sufx is represented by some path from root to leaf b a a Would this still be the case if we hadn’t added $? No a a b
b a
a Sufx trie
a b $ We can think of nodes as having labels, where the label spells out characters on the a b $ a path from the root to the node
b a a $
a a $ b baa
$ b a
a $
$ Sufx trie
a b $
How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T.
a a $ b S = baa Yes, it’s a substring Start at the root and follow the edges
labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie
a b $
How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T.
a a $ b S = baa Yes, it’s a substring Start at the root and follow the edges
labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie
a b $
How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T.
a a $ b Start at the root and follow the edges
labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T S = abaaba Yes, it’s a substring If we exhaust S without falling of, S is a substring of T $ Sufx trie
a b $
How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T.
a a $ b Start at the root and follow the edges labeled with the characters of S x $ b a S = baabb If we “fall of” the trie -- i.e. there is no No, not a substring outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie
a b $
How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but additionally check whether the fnal node in b a a $ the walk has an outgoing edge labeled $
a a $ b S = baa Not a sufx
$ b a
a $
$ Sufx trie
a b $
How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but additionally check whether the fnal node in b a a $ the walk has an outgoing edge labeled $
a a $ b S = baa Not a sufx
$ b a
a $
$ Sufx trie
a b $
How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but b a a $ additionally check whether the fnal node in S = aba the walk has an outgoing edge labeled $ Is a sufx
a a $ b
$ b a
a $
$ Sufx trie
a b $
How do we count the number of times a string S occurs as a substring of T? a b $ a
Follow path corresponding to S. b a a $ Either we fall of, in which case S = aba answer is 0, or we end up at node n n 2 occurrences
and the answer = # of leaf nodes in a a $ b the subtree rooted at n.
Leaves can be counted with depth-frst $ b a traversal.
a $
$ Sufx trie
a b $
How do we count the number of times a string S occurs as a substring of T? a b $ a
Follow path corresponding to S. b a a $ Either we fall of, in which case S = aba answer is 0, or we end up at node n n 2 occurrences
and the answer = # of leaf nodes in a a $ b the subtree rooted at n.
Leaves can be counted with depth-frst $ b a traversal.
a $
$ Sufx trie
a b $
How do we fnd the longest repeated substring of T? a b $ a
Find the deepest node with more than one child b a a $ aba
a a $ b
$ b a
a $
$ Sufx trie
a b $
How do we fnd the longest repeated substring of T? a b $ a
Find the deepest node with more than one child b a a $ aba
a a $ b
$ b a
a $
$ Sufx trie
How many nodes does the sufx trie have? T = aaaa a $ Is there a class of string where the number of sufx trie nodes grows linearly with m?
a $ Yes: e.g. a string of m a’s in a row (am)
a $
a $ • 1 Root • m nodes with incoming a edge • m + 1 nodes with $ incoming $ edge 2m + 2 nodes Sufx trie
How many nodes does the sufx trie have? T = aaaa a $ Is there a class of string where the number of sufx trie nodes grows linearly with m?
a $ Yes: e.g. a string of m a’s in a row (am)
a $
a $ • 1 Root • m nodes with incoming a edge • m + 1 nodes with $ incoming $ edge 2m + 2 nodes Sufx trie
Is there a class of string where the number of sufx trie nodes grows with m2?
Yes: anbn
• 1 root n nodes along “b chain,” right • Figure & example n nodes along “a chain,” middle • by Carl Kingsford • n chains of n “b” nodes hanging of each“a chain” node • 2n + 1 $ leaves (not shown)
n2 + 4n + 2 nodes, where m = 2n Sufx trie
Is there a class of string where the number of sufx trie nodes grows with m2?
Yes: anbn
• 1 root n nodes along “b chain,” right • Figure & example n nodes along “a chain,” middle • by Carl Kingsford • n chains of n “b” nodes hanging of each“a chain” node • 2n + 1 $ leaves (not shown)
n2 + 4n + 2 nodes, where m = 2n Sufx trie: upper bound on size
Could worst-case # nodes be worse than O(m2)?
Root
Max # nodes from top to bottom = length of longest sufx + 1 Sufx trie = m + 1
Deepest leaf
Max # nodes from left to right 2 = max # distinct substrings of any length O(m ) is worst case ≤ m Sufx trie: actual growth k
● ●● m^2 ●● ●● ● ●● 250000 ● actual ●● ●● Built sufx tries for the frst ● ●● m ●● ●● ●● ●● ●● ●● 500 prefxes of the lambda ●● ●● ●● ●● ●● phage virus genome ●● ●● 200000 ● ●● ●● ●● ●● ●● ●● ●● ●● ●● Black curve shows how # ●● ●● ●● ●● ●● ●● 150000 ●● nodes increases with prefx ●● ●● ●● ●● ●● ●● ●● ●●● length ●● ●●● ●● ●●● ●● ●●● ●● ●●● ●●● ●●● ●● ●●● ●●● ●●● ●● ●●●● ●●● ●●● # suffix trie nodes ●● ●●● ●● ●●●● 100000 ●●● ●●● ●●● ●●●● ●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●●●● ●●● ●●●●● ●●●● ●●●●● ●●●● ●●●●● 50000 ●● ●● ●●●● ●●●●● ●●●● ●●●●●● ●●●● ●●●●●● ●●●● ●●●●●● ●●●●● ●●●●●● ●●●●● ●●●●●●● ●●●●● ●●●●●●● ●●●●● ●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 100 200 300 400 500
Length prefix over which suffix trie was built Suffix tries -> Suffix trees Suffix Tree Definitions A Σ+-tree is a rooted tree, T, where each edge is labeled with non-empty strings, where no node has two outgoing edges labeled with strings having the same first character. T is compact if all internal nodes have ≥ 2 children.
• for a node v in T, depth(v) or node-depth(v) is the distance from v to the root.
• node-depth(r) = 0
• string(v) = concatenation of all characters on the path r ⤳ v
• string-depth(v) = |string(v)| (note: string-depth(v) ≥ node-depth(v))
• for a string x, if ∃ node v with string(v) = x, we say node(x) = v
• T displays string x if ∃ node v and string y such that xy = string(v)
• words(T) = { x | T displays x}
• A suffix tree of string s is a compact Σ+-tree such that words(T) = {s’ | s’ is a substring of s}
Defs. from: http://profs.sci.univr.it/~liptak/ALBioinfo/files/sequence_analysis.pdf * Sufx trie: making it smaller
T = abaaba$
Idea 1: Coalesce non-branching paths into a single edge with a string label
$ aba$
Reduces # nodes, edges, guarantees internal nodes have >1 child Sufx tree L leaves, I internal nodes, E edges
E = L + I -1 T = abaaba$ E ≥ 2I (each internal node branches) L + I -1 ≥ 2I ⇒ I ≤ L - 1 a ba $ With respect to m: Howbut many leaves? m $ ba $ HowL many≤ m (atnon-leaf most nodes?m suffixes)≤ m - 1 aba$ ≤ 2m -1 nodes total, or O(m) nodes $ aba$ I ≤ m -1 aba$ E = L + I - 1 ≤ 2m - 2 E + L + I ≤ 4m - 3 ∈ O(m)
No: total length of edge Is the total size O(m) now? Is the total size O(m)labels now? is quadratic in m Sufx tree L leaves, I internal nodes, E edges
E = L + I -1 T = abaaba$ E ≥ 2I (each internal node branches) L + I -1 ≥ 2I ⇒ I ≤ L - 1 a ba $ With respect to m: Howbut many leaves? m $ ba $ HowL many≤ m (atnon-leaf most nodes?m suffixes)≤ m - 1 aba$ ≤ 2m -1 nodes total, or O(m) nodes $ aba$ I ≤ m -1 aba$ E = L + I - 1 ≤ 2m - 2 E + L + I ≤ 4m - 3 ∈ O(m)
No: total length of edge Is the total size O(m) now? Is the total size O(m)labels now? is quadratic in m
NO: The total length of edge labels is quadratic in m. Sufx tree
T = abaaba$ Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (ofset, length) pairs with respect to T.
T = abaaba$
(6, 1) a $ (0, 1) ba (1, 2)
$ (6, 1) ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) $ aba$ (6, 1) aba$ (3, 4)
Space required for sufx tree is now O(m) Sufx tree: leaves hold ofsets where suffixes begin
T = abaaba$ T = abaaba$
(6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6
(6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 0 Sufx tree: labels
Again, each node’s label equals the T = abaaba$ concatenated edge labels from the root to (6, 1) the node. These aren’t stored explicitly. (0, 1) (1, 2) 6
(6, 1) (1, 2) (6, 1) Label = “ba” 5 (3, 4) (3, 4) 4 (6, 1) 1 (3, 4) 3 2 Label = “aaba$” 0 Sufx tree: labels
Because edges can have string labels, we T = abaaba$ must distinguish two notions of “depth”
(6, 1) • Node depth: how many edges we must (0, 1) (1, 2) 6 follow from the root to reach the node
(6, 1) (1, 2) (6, 1) • Label depth: total length of edge labels 5 (3, 4) for edges on path from root to node (3, 4) 4 (6, 1) 1 (3, 4) 3 2 0 Sufx tree: space caveat
Minor point: T = abaaba$ We say the space taken by the edge labels is (6, 1) (0, 1) O(m), because we keep 2 integers per edge and (1, 2) 6 there are O(m) edges
(6, 1) (1, 2) (6, 1) To store one such integer, we need enough bits 5 (3, 4) to distinguish m positions in T, i.e. ceil(log2 m) (3, 4) 4 (6, 1) bits. We usually ignore this factor, since 64 bits is 1 plenty for all practical purposes. (3, 4) 3 2 Similar argument for the pointers / references 0 used to distinguish tree nodes. Sufx tree: building
Naive method 1: build a sufx trie, then coalesce non-branching paths and relabel (6, 1) edges (0, 1) (1, 2) 6 Naive method 2: build a single-edge tree (6, 1) representing only the longest sufx, then (1, 2) (6, 1) augment to include the 2nd-longest, then 5 (3, 4) (3, 4) 4 augment to include 3rd-longest, etc (6, 1) 1 (3, 4) 3 2 Both are O(m ) time, but frst uses 2 O(m2) spaceSuf whilex tree: second implementationuses O(m) 0 Naive method 2 is described in Gusfeld 5.4 PythonPython implementation example here: at: http://nbviewer.ipython.org/6665861 WOTD (Write-Only Top-Down) Construction
Giegerich, Robert, and Stefan Kurtz. "A comparison of imperative and purely functional suffix tree constructions." Science of Computer Programming 25.2 (1995): 187-218. Build a suffix tree for string s$ Recursive construction: For every branching node node(u), subtree of node(u) is determined by all suffixes of s$ where u is a prefix. Recursively construct subtree for all suffixes where u is a prefix.
Definition: remaining suffixes of u
R(node(u)) = { v | uv is a suffix of s$ } * WOTD (Write-Only Top-Down) Construction Build a suffix tree for string s$ Recursive construction:
For every branching node node(u), subtree of node(u) is determined by all suffixes of s$ where u is a prefix.
Recursively construct subtree for all suffixes where u is a prefix.
Definition: remaining suffixes of u R(node(u)) = { v | uv is a suffix of s$ }
Definition: c-group of node(u) group(node(u), c) = { w ∈ Σ* | cw ∈ R(node(u))}
* WOTD (Write-Only Top-Down) Construction
def WOTD(T : tree, node(u): node): for each c ∈ Σ ⋃ {$}: G = group(node(u), c) non-branching suffix ucv = lcp(G) if |G| == 1: add leaf node(ucv) as a child of node(u) else: add inner node(ucv) as a child of node(u) WOTD(T, node(ucv)) branching suffix
Start the algorithm by calling WOTD(T, node(�))
root node * WOTD Example s = ttatctctta$
suffixes are read top-to-bottom
example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf * WOTD Example s = ttatctctta
example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf * WOTD Example s = ttatctctta
example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf * WOTD Example s = ttatctctta
example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf * WOTD Example s = ttatctctta
example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf WOTD Properties
• Worst case time still ∈ O(|T|2)
• Expected case time ∈ O(|T| log |T|)
• Write-only property & recursive construction lends itself well to parallelism
• Good caching properties (locality of reference for substrings belonging to a subtree)
• Top-down construction order allows lazy construction as discussed in:
Giegerich, Robert, Stefan Kurtz, and Jens Stoye. "Efficient implementation of lazy suffix trees." Software: Practice and Experience 33.11 (2003): 1035-1049. Sufx tree: building
OtherMethod methods of choice: Ukkonen’sfor construction: algorithm Ukkonen, Esko. "On-line construction of sufx trees." Algorithmica 14.3 (1995): 249-260.
O(m) time and space
Has online property: if T arrives one character at a time, algorithm efciently updates sufx tree upon each arrival
We won’t cover it here; see Gusfeld Ch. 6 for details Or just Google “Ukkonen’s algorithm” Sufx tree: actual growth
● ● 2m ●●● ●●● 1000 ●●● Built sufx trees for the frst ● ●●● actual ●●● ●●● ● ●●● m ●●● ●●● 500 prefxes of the lambda ●●● ●●● ●●● ●●● ●●● phage virus genome ●●● ●●● ●●● ●● ●●● ●●●● 800 ●●● ●●● ●●● ●●●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● Black curve shows # nodes ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●● increasing with prefx length ●●● ●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● 600 ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●●● ●●● ●●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● Compare with sufx trie: ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●●
400 ● ● ●●● ●●● ●●●●●● # suffix tree nodes ● ● ●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ● ●● ●● ●●● ●●● m^2 ●● ●● ●●● ●●● ●● ●● ●● ●●● ● ●● ●● ●●● ●●● 250000 ● ● ● actual ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ● ●● ●● ●●● ●●● m ●● ●● ●● ●●● ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● 200000 ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●●● ●●●● ●● ●● ● ●●● ●● ●● ●● ●●●
200 ● ● ● ●● ●● ●● ●●● ●● ●● ●● ●●●● ●● ●● ●● ●●● ●● ●● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● 150000 ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●● ●●● ●● ●●● ●● ●● ●●● ●● ●●● ● ●● ●●● ●● ●●● ●● ●●● ●●● ●● ●●● ●● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ●●● ● ●● ●●● ●● ●●● ●●●●●●●● ●●● ●●● ●●●●●●● ●● ●●●● ●●●●●●●● ●●● ●●● ●●●●● # suffix trie nodes ●● ●●● ●●●●●●● ●● ●●●● ●●●●●●●● 100000 ●●● ●●● ●●●●●● ●●● ●●●● ●●●●● ●●● ●●●● ●●●● ●●● ●●●● 0 ● ●●● ●●●● ●●● ●●●● 123 K ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●●●● ●●● ●●●●● ●●●● ●●●●● ●●●● ●●●●● 50000 ●● ●●●● ●●●● ●●●●● ●●●● ●●●●●● nodes ●●● ●●●● ●●●● ●●●●●● ●●●●● ●●●●●● ●●●●● ●●●●●●● 0 100 200 300 400 500 ●●●●● ●●●●●●● ●●●●● ●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● Length prefix over which suffix tree was built 0 100 200 300 400 500
Length prefix over which suffix trie was built Sufx tree
How do we check whether a string S is a substring of T? a ba $
Essentially same procedure as for 6 sufx trie, except we have to deal with S = baa coalesced edges aba$ ba $ aba$ $ Yes, it’s a 2 5 1 substring4
aba$ $
0 3 Sufx tree
How do we check whether a string S is a substring of T? a ba $
Essentially same procedure as for 6 sufx trie, except we have to deal with S = baa coalesced edges aba$ ba $ aba$ $ Yes, it’s a 2 5 1 substring4
aba$ $
0 3 Sufx tree
How do we check whether a string S is a sufx of T? a ba $
Essentially same procedure as for 6 sufx trie, except we have to deal with coalesced edges aba$ ba $ aba$ $ S = aba 2 Yes,5 it’s a su1 fx 4
aba$ $
0 3 Sufx tree
How do we count the number of times a string S occurs as a substring of T? a ba $
6 Same procedure as for sufx trie
aba$ ba $ aba$ $
2 S = 5aba 1 4 Occurs twice aba$ $
0 3 Sufx tree: applications
With sufx tree of T, we can fnd all matches of P to T. Let k = # matches.
E.g., P = ab, T = abaaba$
Step 1: walk down ab path a $ O(n) ba If we “fall of” there are no matches 6 $ baba $ 5 aba$ Step 2: visit all leaf nodes below 4 O(k) $ aba$ Report each leaf ofset as match ofset 1 aba$ 3 2 O(n + k) time 0 abaaba ab ab Sufx tree application: fnd long common substrings
Dots are maximal unique matches (MUMs), a kind of long substring shared by two sequences
Red = match was between like strands, green = diferent strands
Axes show diferent strains of Helicobacter pylori, a bacterium found in the stomach and associated with gastric ulcers Sufx tree application: fnd longest common substring
To fnd the longest common substring (LCS) of X and Y, make a new string X # Y $ where # and $ are both terminal symbols. Build a sufx tree for X # Y $ . X Y
X = xabxa a x #babxba$ b $ Y = babxba X#Y$ = xabxa#babxba$ X Y X Y 5 X Y 12 abx #babxba$ bx $ a ba$ a x
Consider leaves: 4 X Y 11 X 9 Y X Y
ofsets in [0, 4] are a#babxba$ ba$ #babxba$ bxa#babxba$ bxba$ $ a#babxba$ ba$ sufxes of X, ofsets in [6, 11] are sufxes of Y 1 7 3 0 6 10 2 8
Traverse the tree and annotate each node according to whether leaves below it include sufxes of X, Y or both The deepest node annotated with both X and Y has LCS as its label. O(| X | + | Y |) time and space. Sufx tree application: generalized sufx trees
This is one example of many applications where it is useful to build a sufx tree over many strings at once
Such a tree is called a generalized sufx tree. These are introduced in Gusfeld 6.4.
X Y
a x #babxba$ b $
X Y X Y 5 X Y 12 abx #babxba$ bx $ a ba$ a x
4 X Y 11 X 9 Y X Y
a#babxba$ ba$ #babxba$ bxa#babxba$ bxba$ $ a#babxba$ ba$
1 7 3 0 6 10 2 8 Longest Common Extension
Longest common extension: We are given strings S and T. In the future, many pairs (i,j) will be provided as queries, and we want to quickly find:
the longest substring of S starting at i that matches a substring of T starting at j. LCE(i,j) LCE(i,j) S T
i j
Build generalized suffix tree for S and T. O(|S| + |T|) Preprocess tree so that lowest common ancestors (LCA) can be found in constant time. This can be done using range-minimum queries O(|S| + |T|) (RMQ) LCA(i,j)
Create an array mapping suffix numbers to leaf O(|S| + |T|) nodes. Given query (i,j): j i Find the leaf nodes for i and j O(1) Return string of LCA for i and j O(LCE(i,j)) i j Sufx trees in the real world: MUMmer
FASTA fle containing “reference” (“text”) FASTA fle containing ALU string
Indexing phase: ~2 minutes
Matching phase: Columns: very fast 1. Match ofset in T 2. Match ofset in P 3. Length of exact match ... Sufx trees in the real world: MUMmer
MUMmer v3.32 time and memory scaling when indexing increasingly larger fractions of human chromosome 1
● 140 ●
● ● 3500 120
● ● 3000 ● 100
●
2500 ● 80
● ●
2000 ● 60 Time (seconds) ● ● 1500
● 40 ● Peak memory usage (megabytes) Peak 1000 ● ● 20
500 ● ●
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
Fraction of human chromosome 1 indexed Fraction of human chromosome 1 indexed
For whole chromosome 1, took 2m:14s and used 3.94 GB memory Sufx trees in the real world: MUMmer
Attempt to build index for whole human genome reference:
mummer:!suffix!tree!construction!failed:!textlen=3101804822! larger!than!maximal!textlen=536870908
We can predict it would have taken about 47 GB of memory Sufx trees in the real world: the constant factor
While O(m) is desirable, the constant in front of the m limits wider use of sufx trees in practice
Constant factor varies depending on implementation:
Estimate of MUMmer’s constant factor = 3.94 GB / 250 million nt ≈ 15.75 bytes per node
Literature reports implementations achieving as little as 8.5 bytes per node, but no implementation used in practice that I know of is better than ≈ 12.5 bytes per node
Kurtz, Stefan. "Reducing the space requirement of sufx trees." Software Practice and Experience 29.13 (1999): 1149-1171. Sufx tree: summary
GTTATAGCTGATCGCGGCGTAGCGG$ m chars Organizes all sufxes into an GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ incredibly useful, fexible data TATAGCTGATCGCGGCGTAGCGG$ structure, in O(m) time and space ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ A naive method (e.g. sufx trie) GCTGATCGCGGCGTAGCGG$ could easily be quadratic or worse CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ Used in practice for whole genome alignment, GATCGCGGCGTAGCGG$ m(m+1)/2 ATCGCGGCGTAGCGG$ repeat identifcation, etc TCGCGGCGTAGCGG$ chars CGCGGCGTAGCGG$ (3, 1) (7, 1) (1, 1) (0, 1) (25, 1) GCGGCGTAGCGG$ 25 CGGCGTAGCGG$ (4, 1) (6, 2) (8, 18) (13, 1) (3, 1) (12, 14) (2, 24) (9, 17) (10, 16) (7, 1) (1, 1) (16, 1) (25, 1) GGCGTAGCGG$ 7 11 1 8 9 24 GCGTAGCGG$ (5, 21) (12, 14) (8, 18) (23, 3) (14, 12) (19, 7) (16, 1) (4, 22) (6, 2) (8, 18) (15, 1) (20, 6) (2, 24) (17, 9) (25, 1)
3 10 5 20 12 17 2 6 18 0 15 23 CGTAGCGG$
(17, 9) (25, 1) (8, 18) (23, 3) (19, 7) (16, 1) GTAGCGG$
14 22 4 19 16 TAGCGG$
(17, 9) (25, 1) AGCGG$
13 21 GCGG$ CGG$ Actual memory footprint (bytes per node) is GG$ quite high, limiting usefulness G$ $