CSE 549: Suffix & Suffix Trees

All slides in this lecture not marked with “*” of Ben Langmead. KMP is great, but |T| = m |P| = n (note: m,n are opposite from previous lecture) Without pre- Given pre- Without pre- Given pre- processing processing processing processing (KMP) (KMP) (ST) (ST)

Find an O(m+n) O(m) O(m+n) O(n) occurrence of P

Find all occurrences of O(m+n) O(m) O(m + n + k) O(n+k) P

Find an occurrence of O(l(m+n)) O(l(m)) O( m + ln) O(ln) P1,…,Pl

If the text is constant over many patterns, pre-processing the text rather than the pattern is better (and allows other efficient queries). * Tries

A (pronounced “try”) is a rootedtree representing representing a collection a collection of strings of with strings with one node per common prefx

Smallest tree such that:

Each edge is labeled with a character c ∈ Σ

A node has at most one outgoing edge labeled c, for c ∈ Σ

Each key is “spelled out” along some path starting at the root

Natural way to represent either a or a map where keys are strings

This structure is also known as a Σ-tree Tries: example

Represent this map with a trie: i Key Value instant 1 n s t internal 2 internet 3 t e a r The smallest tree such that: n n Each edge is labeled with a character c ∈ Σ t a e A node has at most one outgoing edge 1 labeled c, for c ∈ Σ l t Each key is “spelled out” along some path 2 3 starting at the root Tries: example

i Checking for presence of a key P, where n = | P |, is ???O(n ) time n t s If total length of all keys is N, trie has O ???(N ) nodes t e a r What about | Σ | ? n n Depends how we represent outgoing t a e edges. If we don’t assume | Σ | is a 1 small constant, it shows up in one or l t both bounds. 2 3 Tries: another example

We can index T with a trie. The trie maps to ofsets where they occur

c 4 g 8 t ac 4 a ag 8 14 at 14 c c cc 12 t 12, 2 cc 2 root: g ct 6 6 gt 18 t t gt 0 18, 0 ta 10 a tt 16 10 t 16 Indexing with sufxes

SomeUntil indices now, (e.g. our indexes the inverted have beenindex) based are based on extracting on extracting substrings substrings from T from T

A very diferent approach is to extract sufxes from T. This will lead us to some interesting and practical index data structures:

6 $ $ B A N A N A 5 A$ A $ B A N A N 3 ANA$ A N A $ B A N 1 ANANA$ A N A N A $ B 0 BANANA$ B A N A N A $ 4 NA$ N A $ B A N A 2 NANA$ N A N A $ B A

Sufx Trie Sufx Tree Sufx Array FM Index Trie Definitions A Σ-tree (trie) is a rooted tree where each edge is labeled with a single character c ϵ Σ, such that no node has two outgoing edges labeled with the same character. • for a node v in T, depth(v) or node-depth(v) is the distance from v to the root.

• node-depth(r) = 0

• string(v) = concatenation of all characters on the path r ⤳ v

• string-depth(v) = |string(v)| (note: string-depth(v) ≥ node-depth(v))

• for a string x, if ∃ node v with string(v) = x, we say node(x) = v

• T displays string x if ∃ node v and string y such that xy = string(v)

• words(T) = { x | T displays x}

• A suffix trie of string s is a Σ-tree such that words(T) = {s’ | s’ is a of s}

• An internal/leaf edge leads to an internal/leaf node

Defs. from: http://profs.sci.univr.it/~liptak/ALBioinfo/files/sequence_analysis.pdf * Sufx trie

Build a trie containing all sufxes of a text T

T: GTTATAGCTGATCGCGGCGTAGCGG$ GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ TATAGCTGATCGCGGCGTAGCGG$ ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ GCTGATCGCGGCGTAGCGG$ CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ GATCGCGGCGTAGCGG$ ATCGCGGCGTAGCGG$m(m+1)/2 TCGCGGCGTAGCGG$ CGCGGCGTAGCGG$chars GCGGCGTAGCGG$ CGGCGTAGCGG$ GGCGTAGCGG$ GCGTAGCGG$ CGTAGCGG$ GTAGCGG$ TAGCGG$ AGCGG$ GCGG$ CGG$ GG$ G$ $ Sufx trie

First add special terminal character $ to the end of T

$ is a character that does not appear elsewhere in T, and we defne it to be less than other characters (for DNA: $ < A < C < G < T)

$ enforces a rule we’re all used to using: e.g. “as” comes before “ash” in the dictionary. $ also guarantees no sufx is a prefx of any other sufx.

T: GTTATAGCTGATCGCGGCGTAGCGG$ GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ TATAGCTGATCGCGGCGTAGCGG$ ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ GCTGATCGCGGCGTAGCGG$ CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ GATCGCGGCGTAGCGG$ ATCGCGGCGTAGCGG$ TCGCGGCGTAGCGG$ CGCGGCGTAGCGG$ GCGGCGTAGCGG$ CGGCGTAGCGG$ GGCGTAGCGG$ GCGTAGCGG$ CGTAGCGG$ GTAGCGG$ TAGCGG$ AGCGG$ GCGG$ CGG$ GG$ G$ $ Sufx trie

a b $ Shortest T: abaaba T$: abaaba$ (non-empty) sufx a b $ a

Each path from root to leaf represents a

sufx; each sufx is represented by some b a a $ path from root to leaf

a a $ b Would this still be the case if we hadn’t added $?

$ b a

a $

$

Longest sufx Sufx trie

a b T: abaaba

Each path from root to leaf represents a a b a sufx; each sufx is represented by some path from root to leaf b a a Would this still be the case if we hadn’t added $? No a a b

b a

a Sufx trie

a b $ We can think of nodes as having labels, where the label spells out characters on the a b $ a path from the root to the node

b a a $

a a $ b baa

$ b a

a $

$ Sufx trie

a b $

How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T.

a a $ b S = baa Yes, it’s a substring Start at the root and follow the edges

labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie

a b $

How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T.

a a $ b S = baa Yes, it’s a substring Start at the root and follow the edges

labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie

a b $

How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T.

a a $ b Start at the root and follow the edges

labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T S = abaaba Yes, it’s a substring If we exhaust S without falling of, S is a substring of T $ Sufx trie

a b $

How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T.

a a $ b Start at the root and follow the edges labeled with the characters of S x $ b a S = baabb If we “fall of” the trie -- i.e. there is no No, not a substring outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie

a b $

How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but additionally check whether the fnal node in b a a $ the walk has an outgoing edge labeled $

a a $ b S = baa Not a sufx

$ b a

a $

$ Sufx trie

a b $

How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but additionally check whether the fnal node in b a a $ the walk has an outgoing edge labeled $

a a $ b S = baa Not a sufx

$ b a

a $

$ Sufx trie

a b $

How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but b a a $ additionally check whether the fnal node in S = aba the walk has an outgoing edge labeled $ Is a sufx

a a $ b

$ b a

a $

$ Sufx trie

a b $

How do we count the number of times a string S occurs as a substring of T? a b $ a

Follow path corresponding to S. b a a $ Either we fall of, in which case S = aba answer is 0, or we end up at node n n 2 occurrences

and the answer = # of leaf nodes in a a $ b the subtree rooted at n.

Leaves can be counted with depth-frst $ b a traversal.

a $

$ Sufx trie

a b $

How do we count the number of times a string S occurs as a substring of T? a b $ a

Follow path corresponding to S. b a a $ Either we fall of, in which case S = aba answer is 0, or we end up at node n n 2 occurrences

and the answer = # of leaf nodes in a a $ b the subtree rooted at n.

Leaves can be counted with depth-frst $ b a traversal.

a $

$ Sufx trie

a b $

How do we fnd the longest repeated substring of T? a b $ a

Find the deepest node with more than one child b a a $ aba

a a $ b

$ b a

a $

$ Sufx trie

a b $

How do we fnd the longest repeated substring of T? a b $ a

Find the deepest node with more than one child b a a $ aba

a a $ b

$ b a

a $

$ Sufx trie

How many nodes does the sufx trie have? T = aaaa a $ Is there a class of string where the number of sufx trie nodes grows linearly with m?

a $ Yes: e.g. a string of m a’s in a row (am)

a $

a $ • 1 Root • m nodes with incoming a edge • m + 1 nodes with $ incoming $ edge 2m + 2 nodes Sufx trie

How many nodes does the sufx trie have? T = aaaa a $ Is there a class of string where the number of sufx trie nodes grows linearly with m?

a $ Yes: e.g. a string of m a’s in a row (am)

a $

a $ • 1 Root • m nodes with incoming a edge • m + 1 nodes with $ incoming $ edge 2m + 2 nodes Sufx trie

Is there a class of string where the number of sufx trie nodes grows with m2?

Yes: anbn

• 1 root n nodes along “b chain,” right • Figure & example n nodes along “a chain,” middle • by Carl Kingsford • n chains of n “b” nodes hanging of each“a chain” node • 2n + 1 $ leaves (not shown)

n2 + 4n + 2 nodes, where m = 2n Sufx trie

Is there a class of string where the number of sufx trie nodes grows with m2?

Yes: anbn

• 1 root n nodes along “b chain,” right • Figure & example n nodes along “a chain,” middle • by Carl Kingsford • n chains of n “b” nodes hanging of each“a chain” node • 2n + 1 $ leaves (not shown)

n2 + 4n + 2 nodes, where m = 2n Sufx trie: upper bound on size

Could worst-case # nodes be worse than O(m2)?

Root

Max # nodes from top to bottom = length of longest sufx + 1 Sufx trie = m + 1

Deepest leaf

Max # nodes from left to right 2 = max # distinct substrings of any length O(m ) is worst case ≤ m Sufx trie: actual growth k

● ●● m^2 ●● ●● ● ●● 250000 ● actual ●● ●● Built sufx tries for the frst ● ●● m ●● ●● ●● ●● ●● ●● 500 prefxes of the lambda ●● ●● ●● ●● ●● phage virus genome ●● ●● 200000 ● ●● ●● ●● ●● ●● ●● ●● ●● ●● Black curve shows how # ●● ●● ●● ●● ●● ●● 150000 ●● nodes increases with prefx ●● ●● ●● ●● ●● ●● ●● ●●● length ●● ●●● ●● ●●● ●● ●●● ●● ●●● ●●● ●●● ●● ●●● ●●● ●●● ●● ●●●● ●●● ●●● # suffix trie nodes ●● ●●● ●● ●●●● 100000 ●●● ●●● ●●● ●●●● ●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●●●● ●●● ●●●●● ●●●● ●●●●● ●●●● ●●●●● 50000 ●● ●● ●●●● ●●●●● ●●●● ●●●●●● ●●●● ●●●●●● ●●●● ●●●●●● ●●●●● ●●●●●● ●●●●● ●●●●●●● ●●●●● ●●●●●●● ●●●●● ●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 100 200 300 400 500

Length prefix over which suffix trie was built Suffix tries -> Suffix trees Suffix Tree Definitions A Σ+-tree is a rooted tree, T, where each edge is labeled with non-empty strings, where no node has two outgoing edges labeled with strings having the same first character. T is compact if all internal nodes have ≥ 2 children.

• for a node v in T, depth(v) or node-depth(v) is the distance from v to the root.

• node-depth(r) = 0

• string(v) = concatenation of all characters on the path r ⤳ v

• string-depth(v) = |string(v)| (note: string-depth(v) ≥ node-depth(v))

• for a string x, if ∃ node v with string(v) = x, we say node(x) = v

• T displays string x if ∃ node v and string y such that xy = string(v)

• words(T) = { x | T displays x}

• A suffix tree of string s is a compact Σ+-tree such that words(T) = {s’ | s’ is a substring of s}

Defs. from: http://profs.sci.univr.it/~liptak/ALBioinfo/files/sequence_analysis.pdf * Sufx trie: making it smaller

T = abaaba$

Idea 1: Coalesce non-branching paths into a single edge with a string label

$ aba$

Reduces # nodes, edges, guarantees internal nodes have >1 child Sufx tree L leaves, I internal nodes, E edges

E = L + I -1 T = abaaba$ E ≥ 2I (each internal node branches) L + I -1 ≥ 2I ⇒ I ≤ L - 1 a ba $ With respect to m: Howbut many leaves? m $ ba $ HowL many≤ m (atnon-leaf most nodes?m suffixes)≤ m - 1 aba$ ≤ 2m -1 nodes total, or O(m) nodes $ aba$ I ≤ m -1 aba$ E = L + I - 1 ≤ 2m - 2 E + L + I ≤ 4m - 3 ∈ O(m)

No: total length of edge Is the total size O(m) now? Is the total size O(m)labels now? is quadratic in m Sufx tree L leaves, I internal nodes, E edges

E = L + I -1 T = abaaba$ E ≥ 2I (each internal node branches) L + I -1 ≥ 2I ⇒ I ≤ L - 1 a ba $ With respect to m: Howbut many leaves? m $ ba $ HowL many≤ m (atnon-leaf most nodes?m suffixes)≤ m - 1 aba$ ≤ 2m -1 nodes total, or O(m) nodes $ aba$ I ≤ m -1 aba$ E = L + I - 1 ≤ 2m - 2 E + L + I ≤ 4m - 3 ∈ O(m)

No: total length of edge Is the total size O(m) now? Is the total size O(m)labels now? is quadratic in m

NO: The total length of edge labels is quadratic in m. Sufx tree

T = abaaba$ Idea 2: Store T itself in addition to the tree. Convert tree’s edge labels to (ofset, length) pairs with respect to T.

T = abaaba$

(6, 1) a $ (0, 1) ba (1, 2)

$ (6, 1) ba $ (1, 2) (6, 1) aba$ (3, 4) (3, 4) $ aba$ (6, 1) aba$ (3, 4)

Space required for sufx tree is now O(m) Sufx tree: leaves hold ofsets where suffixes begin

T = abaaba$ T = abaaba$

(6, 1) (6, 1) (0, 1) (0, 1) (1, 2) (1, 2) 6

(6, 1) (6, 1) (1, 2) (6, 1) (1, 2) (6, 1) (3, 4) 5 (3, 4) (3, 4) (3, 4) 4 (6, 1) (6, 1) 1 (3, 4) (3, 4) 3 2 0 Sufx tree: labels

Again, each node’s label equals the T = abaaba$ concatenated edge labels from the root to (6, 1) the node. These aren’t stored explicitly. (0, 1) (1, 2) 6

(6, 1) (1, 2) (6, 1) Label = “ba” 5 (3, 4) (3, 4) 4 (6, 1) 1 (3, 4) 3 2 Label = “aaba$” 0 Sufx tree: labels

Because edges can have string labels, we T = abaaba$ must distinguish two notions of “depth”

(6, 1) • Node depth: how many edges we must (0, 1) (1, 2) 6 follow from the root to reach the node

(6, 1) (1, 2) (6, 1) • Label depth: total length of edge labels 5 (3, 4) for edges on path from root to node (3, 4) 4 (6, 1) 1 (3, 4) 3 2 0 Sufx tree: space caveat

Minor point: T = abaaba$ We say the space taken by the edge labels is (6, 1) (0, 1) O(m), because we keep 2 integers per edge and (1, 2) 6 there are O(m) edges

(6, 1) (1, 2) (6, 1) To store one such integer, we need enough bits 5 (3, 4) to distinguish m positions in T, i.e. ceil(log2 m) (3, 4) 4 (6, 1) bits. We usually ignore this factor, since 64 bits is 1 plenty for all practical purposes. (3, 4) 3 2 Similar argument for the pointers / references 0 used to distinguish tree nodes. Sufx tree: building

Naive method 1: build a sufx trie, then coalesce non-branching paths and relabel (6, 1) edges (0, 1) (1, 2) 6 Naive method 2: build a single-edge tree (6, 1) representing only the longest sufx, then (1, 2) (6, 1) augment to include the 2nd-longest, then 5 (3, 4) (3, 4) 4 augment to include 3rd-longest, etc (6, 1) 1 (3, 4) 3 2 Both are O(m ) time, but frst uses 2 O(m2) spaceSuf whilex tree: second implementationuses O(m) 0 Naive method 2 is described in Gusfeld 5.4 PythonPython implementation example here: at: http://nbviewer.ipython.org/6665861 WOTD (Write-Only Top-Down) Construction

Giegerich, Robert, and Stefan Kurtz. "A comparison of imperative and purely functional suffix tree constructions." Science of Computer Programming 25.2 (1995): 187-218. Build a suffix tree for string s$ Recursive construction: For every branching node node(u), subtree of node(u) is determined by all suffixes of s$ where u is a prefix. Recursively construct subtree for all suffixes where u is a prefix.

Definition: remaining suffixes of u

R(node(u)) = { v | uv is a suffix of s$ } * WOTD (Write-Only Top-Down) Construction Build a suffix tree for string s$ Recursive construction:

For every branching node node(u), subtree of node(u) is determined by all suffixes of s$ where u is a prefix.

Recursively construct subtree for all suffixes where u is a prefix.

Definition: remaining suffixes of u R(node(u)) = { v | uv is a suffix of s$ }

Definition: c-group of node(u) group(node(u), c) = { w ∈ Σ* | cw ∈ R(node(u))}

* WOTD (Write-Only Top-Down) Construction

def WOTD(T : tree, node(u): node): for each c ∈ Σ ⋃ {$}: G = group(node(u), c) non-branching suffix ucv = lcp(G) if |G| == 1: add leaf node(ucv) as a child of node(u) else: add inner node(ucv) as a child of node(u) WOTD(T, node(ucv)) branching suffix

Start the algorithm by calling WOTD(T, node(�))

root node * WOTD Example s = ttatctctta$

suffixes are read top-to-bottom

example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf * WOTD Example s = ttatctctta

example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf * WOTD Example s = ttatctctta

example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf * WOTD Example s = ttatctctta

example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf * WOTD Example s = ttatctctta

example from: http://www.mi.fu-berlin.de/wiki/pub/ABI/SS13Lecture4Materials/wotd.pdf WOTD Properties

• Worst case time still ∈ O(|T|2)

• Expected case time ∈ O(|T| log |T|)

• Write-only property & recursive construction lends itself well to parallelism

• Good caching properties (locality of reference for substrings belonging to a subtree)

• Top-down construction order allows lazy construction as discussed in:

Giegerich, Robert, Stefan Kurtz, and Jens Stoye. "Efficient implementation of lazy suffix trees." Software: Practice and Experience 33.11 (2003): 1035-1049. Sufx tree: building

OtherMethod methods of choice: Ukkonen’sfor construction: algorithm Ukkonen, Esko. "On-line construction of sufx trees." Algorithmica 14.3 (1995): 249-260.

O(m) time and space

Has online property: if T arrives one character at a time, algorithm efciently updates sufx tree upon each arrival

We won’t cover it here; see Gusfeld Ch. 6 for details Or just Google “Ukkonen’s algorithm” Sufx tree: actual growth

● ● 2m ●●● ●●● 1000 ●●● Built sufx trees for the frst ● ●●● actual ●●● ●●● ● ●●● m ●●● ●●● 500 prefxes of the lambda ●●● ●●● ●●● ●●● ●●● phage virus genome ●●● ●●● ●●● ●● ●●● ●●●● 800 ●●● ●●● ●●● ●●●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● Black curve shows # nodes ●●● ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●● increasing with prefx length ●●● ●●●● ●●● ●●●● ●●● ●●● ●●● ●●●● 600 ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●●● ●●● ●●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● Compare with sufx trie: ●●● ●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●●● ●●●●●●

400 ● ● ●●● ●●● ●●●●●● # nodes ● ● ●● ●●● ●●●● ●●●●●● ●●● ●●● ●●●●●● ●●● ●●● ●●●●●● ● ●● ●● ●●● ●●● m^2 ●● ●● ●●● ●●● ●● ●● ●● ●●● ● ●● ●● ●●● ●●● 250000 ● ● ● actual ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ● ●● ●● ●●● ●●● m ●● ●● ●● ●●● ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● 200000 ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●●● ●●●● ●● ●● ● ●●● ●● ●● ●● ●●●

200 ● ● ● ●● ●● ●● ●●● ●● ●● ●● ●●●● ●● ●● ●● ●●● ●● ●● ● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● 150000 ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●● ● ●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●● ●●● ●● ●●● ●● ●● ●●● ●● ●●● ● ●● ●●● ●● ●●● ●● ●●● ●●● ●● ●●● ●● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ●●● ● ●● ●●● ●● ●●● ●●●●●●●● ●●● ●●● ●●●●●●● ●● ●●●● ●●●●●●●● ●●● ●●● ●●●●● # suffix trie nodes ●● ●●● ●●●●●●● ●● ●●●● ●●●●●●●● 100000 ●●● ●●● ●●●●●● ●●● ●●●● ●●●●● ●●● ●●●● ●●●● ●●● ●●●● 0 ● ●●● ●●●● ●●● ●●●● 123 K ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●●● ●●● ●●●●● ●●● ●●●●● ●●●● ●●●●● ●●●● ●●●●● 50000 ●● ●●●● ●●●● ●●●●● ●●●● ●●●●●● nodes ●●● ●●●● ●●●● ●●●●●● ●●●●● ●●●●●● ●●●●● ●●●●●●● 0 100 200 300 400 500 ●●●●● ●●●●●●● ●●●●● ●●●●●●●● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● Length prefix over which suffix tree was built 0 100 200 300 400 500

Length prefix over which suffix trie was built Sufx tree

How do we check whether a string S is a substring of T? a ba $

Essentially same procedure as for 6 sufx trie, except we have to deal with S = baa coalesced edges aba$ ba $ aba$ $ Yes, it’s a 2 5 1 substring4

aba$ $

0 3 Sufx tree

How do we check whether a string S is a substring of T? a ba $

Essentially same procedure as for 6 sufx trie, except we have to deal with S = baa coalesced edges aba$ ba $ aba$ $ Yes, it’s a 2 5 1 substring4

aba$ $

0 3 Sufx tree

How do we check whether a string S is a sufx of T? a ba $

Essentially same procedure as for 6 sufx trie, except we have to deal with coalesced edges aba$ ba $ aba$ $ S = aba 2 Yes,5 it’s a su1 fx 4

aba$ $

0 3 Sufx tree

How do we count the number of times a string S occurs as a substring of T? a ba $

6 Same procedure as for sufx trie

aba$ ba $ aba$ $

2 S = 5aba 1 4 Occurs twice aba$ $

0 3 Sufx tree: applications

With sufx tree of T, we can fnd all matches of P to T. Let k = # matches.

E.g., P = ab, T = abaaba$

Step 1: walk down ab path a $ O(n) ba If we “fall of” there are no matches 6 $ baba $ 5 aba$ Step 2: visit all leaf nodes below 4 O(k) $ aba$ Report each leaf ofset as match ofset 1 aba$ 3 2 O(n + k) time 0 abaaba ab ab Sufx tree application: fnd long common substrings

Dots are maximal unique matches (MUMs), a kind of long substring shared by two sequences

Red = match was between like strands, green = diferent strands

Axes show diferent strains of Helicobacter pylori, a bacterium found in the stomach and associated with gastric ulcers Sufx tree application: fnd longest common substring

To fnd the longest common substring (LCS) of X and Y, make a new string X # Y $ where # and $ are both terminal symbols. Build a sufx tree for X # Y $ . X Y

X = xabxa a x #babxba$ b $ Y = babxba X#Y$ = xabxa#babxba$ X Y X Y 5 X Y 12 abx #babxba$ bx $ a ba$ a x

Consider leaves: 4 X Y 11 X 9 Y X Y

ofsets in [0, 4] are a#babxba$ ba$ #babxba$ bxa#babxba$ bxba$ $ a#babxba$ ba$ sufxes of X, ofsets in [6, 11] are sufxes of Y 1 7 3 0 6 10 2 8

Traverse the tree and annotate each node according to whether leaves below it include sufxes of X, Y or both The deepest node annotated with both X and Y has LCS as its label. O(| X | + | Y |) time and space. Sufx tree application: generalized sufx trees

This is one example of many applications where it is useful to build a sufx tree over many strings at once

Such a tree is called a generalized sufx tree. These are introduced in Gusfeld 6.4.

X Y

a x #babxba$ b $

X Y X Y 5 X Y 12 abx #babxba$ bx $ a ba$ a x

4 X Y 11 X 9 Y X Y

a#babxba$ ba$ #babxba$ bxa#babxba$ bxba$ $ a#babxba$ ba$

1 7 3 0 6 10 2 8 Longest Common Extension

Longest common extension: We are given strings S and T. In the future, many pairs (i,j) will be provided as queries, and we want to quickly find:

the longest substring of S starting at i that matches a substring of T starting at j. LCE(i,j) LCE(i,j) S T

i j

Build generalized suffix tree for S and T. O(|S| + |T|) Preprocess tree so that lowest common ancestors (LCA) can be found in constant time. This can be done using range-minimum queries O(|S| + |T|) (RMQ) LCA(i,j)

Create an array mapping suffix numbers to leaf O(|S| + |T|) nodes. Given query (i,j): j i Find the leaf nodes for i and j O(1) Return string of LCA for i and j O(LCE(i,j)) i j Sufx trees in the real world: MUMmer

FASTA fle containing “reference” (“text”) FASTA fle containing ALU string

Indexing phase: ~2 minutes

Matching phase: Columns: very fast 1. Match ofset in T 2. Match ofset in P 3. Length of exact match ... Sufx trees in the real world: MUMmer

MUMmer v3.32 time and memory scaling when indexing increasingly larger fractions of human chromosome 1

● 140 ●

● ● 3500 120

● ● 3000 ● 100

2500 ● 80

● ●

2000 ● 60 Time (seconds) ● ● 1500

● 40 ● Peak memory usage (megabytes) Peak 1000 ● ● 20

500 ● ●

0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Fraction of human chromosome 1 indexed Fraction of human chromosome 1 indexed

For whole chromosome 1, took 2m:14s and used 3.94 GB memory Sufx trees in the real world: MUMmer

Attempt to build index for whole reference:

mummer:!suffix!tree!construction!failed:!textlen=3101804822! larger!than!maximal!textlen=536870908

We can predict it would have taken about 47 GB of memory Sufx trees in the real world: the constant factor

While O(m) is desirable, the constant in front of the m limits wider use of sufx trees in practice

Constant factor varies depending on implementation:

Estimate of MUMmer’s constant factor = 3.94 GB / 250 million nt ≈ 15.75 bytes per node

Literature reports implementations achieving as little as 8.5 bytes per node, but no implementation used in practice that I know of is better than ≈ 12.5 bytes per node

Kurtz, Stefan. "Reducing the space requirement of sufx trees." Software Practice and Experience 29.13 (1999): 1149-1171. Sufx tree: summary

GTTATAGCTGATCGCGGCGTAGCGG$ m chars Organizes all sufxes into an GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ incredibly useful, fexible data TATAGCTGATCGCGGCGTAGCGG$ structure, in O(m) time and space ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ A naive method (e.g. sufx trie) GCTGATCGCGGCGTAGCGG$ could easily be quadratic or worse CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ Used in practice for whole genome alignment, GATCGCGGCGTAGCGG$ m(m+1)/2 ATCGCGGCGTAGCGG$ repeat identifcation, etc TCGCGGCGTAGCGG$ chars CGCGGCGTAGCGG$ (3, 1) (7, 1) (1, 1) (0, 1) (25, 1) GCGGCGTAGCGG$ 25 CGGCGTAGCGG$ (4, 1) (6, 2) (8, 18) (13, 1) (3, 1) (12, 14) (2, 24) (9, 17) (10, 16) (7, 1) (1, 1) (16, 1) (25, 1) GGCGTAGCGG$ 7 11 1 8 9 24 GCGTAGCGG$ (5, 21) (12, 14) (8, 18) (23, 3) (14, 12) (19, 7) (16, 1) (4, 22) (6, 2) (8, 18) (15, 1) (20, 6) (2, 24) (17, 9) (25, 1)

3 10 5 20 12 17 2 6 18 0 15 23 CGTAGCGG$

(17, 9) (25, 1) (8, 18) (23, 3) (19, 7) (16, 1) GTAGCGG$

14 22 4 19 16 TAGCGG$

(17, 9) (25, 1) AGCGG$

13 21 GCGG$ CGG$ Actual memory footprint (bytes per node) is GG$ quite high, limiting usefulness G$ $