Suffix Trees

CSE 549: Suffix Tries & Suffix Trees All slides in this lecture not marked with “*” of Ben Langmead. KMP is great, but |T| = m |P| = n (note: m,n are opposite from previous lecture) Without pre- Given pre- Without pre- Given pre- processing processing processing processing (KMP) (KMP) (ST) (ST) Find an O(m+n) O(m) O(m+n) O(n) occurrence of P Find all occurrences of O(m+n) O(m) O(m + n + k) O(n+k) P Find an occurrence of O(l(m+n)) O(l(m)) O( m + ln) O(ln) P1,…,Pl If the text is constant over many patterns, pre-processing the text rather than the pattern is better (and allows other efficient queries). * Tries A trie (pronounced “try”) is a rootedtree representing tree representing a collection a collection of strings of with strings with one node per common prefx Smallest tree such that: Each edge is labeled with a character c ∈ Σ A node has at most one outgoing edge labeled c, for c ∈ Σ Each key is “spelled out” along some path starting at the root Natural way to represent either a set or a map where keys are strings This structure is also known as a Σ-tree Tries: example Represent this map with a trie: i Key Value instant 1 n s t internal 2 internet 3 t e a r The smallest tree such that: n n Each edge is labeled with a character c ∈ Σ t a e A node has at most one outgoing edge 1 labeled c, for c ∈ Σ l t Each key is “spelled out” along some path 2 3 starting at the root Tries: example i Checking for presence of a key P, where n = | P |, is ???O(n ) time n t s If total length of all keys is N, trie has O ???(N ) nodes t e a r What about | Σ | ? n n Depends how we represent outgoing t a e edges. If we don’t assume | Σ | is a 1 small constant, it shows up in one or l t both bounds. 2 3 Tries: another example We can index T with a trie. The trie maps substrings to ofsets where they occur c 4 g 8 t ac 4 a ag 8 14 at 14 c c cc 12 t 12, 2 cc 2 root: g ct 6 6 gt 18 t t gt 0 18, 0 ta 10 a tt 16 10 t 16 Indexing with sufxes SomeUntil indices now, (e.g. our indexes the inverted have beenindex) based are based on extracting on extracting substrings substrings from T from T A very diferent approach is to extract sufxes from T. This will lead us to some interesting and practical index data structures: 6 $ $ B A N A N A 5 A$ A $ B A N A N 3 ANA$ A N A $ B A N 1 ANANA$ A N A N A $ B 0 BANANA$ B A N A N A $ 4 NA$ N A $ B A N A 2 NANA$ N A N A $ B A Sufx Trie Sufx Tree Sufx Array FM Index Trie Definitions A Σ-tree (trie) is a rooted tree where each edge is labeled with a single character c ϵ Σ, such that no node has two outgoing edges labeled with the same character. • for a node v in T, depth(v) or node-depth(v) is the distance from v to the root. • node-depth(r) = 0 • string(v) = concatenation of all characters on the path r ⤳ v • string-depth(v) = |string(v)| (note: string-depth(v) ≥ node-depth(v)) • for a string x, if ∃ node v with string(v) = x, we say node(x) = v • T displays string x if ∃ node v and string y such that xy = string(v) • words(T) = { x | T displays x} • A suffix trie of string s is a Σ-tree such that words(T) = {s’ | s’ is a substring of s} • An internal/leaf edge leads to an internal/leaf node Defs. from: http://profs.sci.univr.it/~liptak/ALBioinfo/files/sequence_analysis.pdf * Sufx trie Build a trie containing all sufxes of a text T T: GTTATAGCTGATCGCGGCGTAGCGG$ GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ TATAGCTGATCGCGGCGTAGCGG$ ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ GCTGATCGCGGCGTAGCGG$ CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ GATCGCGGCGTAGCGG$ ATCGCGGCGTAGCGG$m(m+1)/2 TCGCGGCGTAGCGG$ CGCGGCGTAGCGG$chars GCGGCGTAGCGG$ CGGCGTAGCGG$ GGCGTAGCGG$ GCGTAGCGG$ CGTAGCGG$ GTAGCGG$ TAGCGG$ AGCGG$ GCGG$ CGG$ GG$ G$ $ Sufx trie First add special terminal character $ to the end of T $ is a character that does not appear elsewhere in T, and we defne it to be less than other characters (for DNA: $ < A < C < G < T) $ enforces a rule we’re all used to using: e.g. “as” comes before “ash” in the dictionary. $ also guarantees no sufx is a prefx of any other sufx. T: GTTATAGCTGATCGCGGCGTAGCGG$ GTTATAGCTGATCGCGGCGTAGCGG$ TTATAGCTGATCGCGGCGTAGCGG$ TATAGCTGATCGCGGCGTAGCGG$ ATAGCTGATCGCGGCGTAGCGG$ TAGCTGATCGCGGCGTAGCGG$ AGCTGATCGCGGCGTAGCGG$ GCTGATCGCGGCGTAGCGG$ CTGATCGCGGCGTAGCGG$ TGATCGCGGCGTAGCGG$ GATCGCGGCGTAGCGG$ ATCGCGGCGTAGCGG$ TCGCGGCGTAGCGG$ CGCGGCGTAGCGG$ GCGGCGTAGCGG$ CGGCGTAGCGG$ GGCGTAGCGG$ GCGTAGCGG$ CGTAGCGG$ GTAGCGG$ TAGCGG$ AGCGG$ GCGG$ CGG$ GG$ G$ $ Sufx trie a b $ Shortest T: abaaba T$: abaaba$ (non-empty) sufx a b $ a Each path from root to leaf represents a sufx; each sufx is represented by some b a a $ path from root to leaf a a $ b Would this still be the case if we hadn’t added $? $ b a a $ $ Longest sufx Sufx trie a b T: abaaba Each path from root to leaf represents a a b a sufx; each sufx is represented by some path from root to leaf b a a Would this still be the case if we hadn’t added $? No a a b b a a Sufx trie a b $ We can think of nodes as having labels, where the label spells out characters on the a b $ a path from the root to the node b a a $ a a $ b baa $ b a a $ $ Sufx trie a b $ How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T. a a $ b S = baa Yes, it’s a substring Start at the root and follow the edges labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie a b $ How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T. a a $ b S = baa Yes, it’s a substring Start at the root and follow the edges labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie a b $ How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T. a a $ b Start at the root and follow the edges labeled with the characters of S $ b a If we “fall of” the trie -- i.e. there is no outgoing edge for next character of S, then a $ S is not a substring of T S = abaaba Yes, it’s a substring If we exhaust S without falling of, S is a substring of T $ Sufx trie a b $ How do we check whether a string S is a substring of T? a b $ a Note: Each of T’s substrings is spelled out along a path from the root. I.e., every b a a $ substring is a prefx of some sufx of T. a a $ b Start at the root and follow the edges labeled with the characters of S x $ b a S = baabb If we “fall of” the trie -- i.e. there is no No, not a substring outgoing edge for next character of S, then a $ S is not a substring of T If we exhaust S without falling of, S is a substring of T $ Sufx trie a b $ How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but additionally check whether the fnal node in b a a $ the walk has an outgoing edge labeled $ a a $ b S = baa Not a sufx $ b a a $ $ Sufx trie a b $ How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but additionally check whether the fnal node in b a a $ the walk has an outgoing edge labeled $ a a $ b S = baa Not a sufx $ b a a $ $ Sufx trie a b $ How do we check whether a string S is a sufx of T? a b $ a Same procedure as for substring, but b a a $ additionally check whether the fnal node in S = aba the walk has an outgoing edge labeled $ Is a sufx a a $ b $ b a a $ $ Sufx trie a b $ How do we count the number of times a string S occurs as a substring of T? a b $ a Follow path corresponding to S.

Suffix Trees

Compressed Suffix Trees with Full Functionality

Suffix Trees

Search Trees

Lecture 1: Introduction

Suffix Trees for Fast Sensor Data Forwarding

Tries and String Matching

Suffix Trees and Suffix Arrays in Primary and Secondary Storage Pang Ko Iowa State University

Finger Search Trees

Lecture Notes of CSCI5610 Advanced Data Structures

String Algorithm

Pointer-Machine Algorithms for Fully-Online Construction of Suffix

Algorithm for Character Recognition Based on the Trie Structure