
12/29/17 Key take home messages • Suffix trie, Suffix tree Text Algorithms (6EAP) – Application examples Full text indexing • Suffix array Jaak Vilo • Burrows Wheeler Transform (BWT) 2017 fall – Compact Suffix tree representation: BWT + succinct Jaak Vilo MTAT.03.190 Text Algorithms 1 Problem E.g. Dictionary problem • Given P and S – find all exact or approximate • Does P belong to a dictionary D={d1,…,dn} occurrences of P in S – Build a binary search tree of D – B-Tree of D • You are allowed to preprocess S (and P, of – Hashing course) – Sorting + Binary search • Build a keyword trie: search in O(|P|) • Goal: to speed up the searches – Assuming alphabet has up to a constant size c – See Aho-Corasick algorithm, Trie construction Sorted array and binary search Sorted array and binary search 1 13 1 13 global info global info head search head search he stop he stop header informal header informal hers hers index index happy his show happy his show O( |P| log n ) 1 12/29/17 Trie for D={ he, hers, his, she} S != set of words 0 • S of length n h s 1 3 • How to index? i e h 2 6 4 • Index from every position of a text r s e 8 5 s 7 • Prefix of every possible suffix is important 9 O( |P| ) Suffix tree • Definition: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with Trie(babaab) their parents. • Definition ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a a b babaab rooted, labeled tree with a leaf for each non-empty suffix of S. b a Furthermore, a suffix tree satisfies the following properties: abaab a a b • Each internal node, other than the root, has at least two children; baab b • Each edge leaving a particular node is labeled with a non-empty substring aab a a a of S of which the first symbol is unique among all first symbols of the edge ab b b a labels of the edges leaving this particular node; b • For any leaf in the tree, the concatenation of the edge labels on the path b from the root to this leaf exactly spells out a non-empty suffix of s. • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Hardcover - 534 pages 1st edition (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. Literature on suffix trees • http://en.wikipedia.org/wiki/Suffix_tree • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Hardcover - 534 pages 1st edition (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: http://stackoverflow.com/questions/9452701/u 89--208) kkonens-suffix-tree-algorithm-in-plain-english • E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249-60, 1995. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751 • Ching-Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-105, January, 2005. http://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3 • CPM articles archive: http://www.cs.ucr.edu/~stelo/cpm/ • Mark Nelson. Fast String Searching With Suffix Trees Dr. Dobb's Journal, August, 1996. http://www.dogma.net/markn/articles/suffixt/suffixt.htm 2 12/29/17 Partly based on : Suffix tree and suffix array techniques for pattern ttttttttttttttgagacggagtctcgctctg analysis in strings tcgcccaggctggagtgcagtggcggg Esko Ukkonen atctcggctcactgcaagctccgcctcc Univ Helsinki cgggttcacgccattctcctgcctcagcc Erice School 30 Oct 2005 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt tcccaagtagctgggactacaggcgcc cgccactacgcccggctaattttttgtattt High-throughput genome-scale sequence analysis and ttagtagagacggggtttcaccgttttagc mapping using compressed data structures Veli Mäkinen cgggatggtctcgatctcctgacctcgtg Department of Computer Science atccgcccgcctcggcctcccaaagtgc University of Helsinki tgggattacaggcgt 13 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt14 Solution: backtracking with suffix Analysis of a string of symbols tree • T = h a t t i v a t t i ’text’ • P = a t t ’pattern’ • Find the occurrences of P in T: h a t t i v a t t i • Pattern synthesis: #(t) = 4 #(atti) = 2 #(t****t) = 2 ...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG..... E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt15 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 16 Pattern finding & synthesis problems • T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet 1. Suffix tree • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different 2. Suffix array patterns P can be found fast – static text, any given pattern P 3. Some applications • Pattern synthesis problem: Learn from T new patterns that occur surprisingly often 4. Finding motifs • What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, … 17 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt18 3 12/29/17 The suffix tree Tree(T) of T Suffix trie and suffix tree • abaab data structure suffix tree, Tree(T), is baab Trie(abaab) compacted trie that represents all the suffixes aab of string T ab a b b b • linear size: |Tree(T)| = O(|T|) a a a b a • can be constructed in linear time O(|T|) a b • has myriad virtues (A. Apostolico) b • is well-known: 366 000 Google hits E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt19 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt20 Suffix trie and suffix tree Trie(T) can be large abaab • 2 baab |Trie(T)| = O(|T| ) aab • bad example: T = anbn ab b • Trie(T) can be seen as a DFA: language Trie(abaab) Tree(abaab) accepted = the suffixes of T a b a • minimize the DFA => directed cyclic word baab b a a ab graph (’DAWG’) a baab b a a b b 21 22 Tree(T) is of linear size Tree(hattivatti) hattivatti vatti • attivatti only the internal branching nodes and the t i vatti leaves represented explicitly ttivatti ti i tivatti atti i • edges labeled by substrings of T vatti ivatti hattivatti • v = node(α) if the path from root to v spells α ivatti vatti ti • vatti vatti one-to-one correspondence of leaves and atti vatti tti suffixes tti atti tivatti ti hattivatti ttivatti • |T| leaves, hence < |T| internal nodes attivatti • |Tree(T)| = O(|T| + size(edge labels)) i 23 24 4 12/29/17 Tree(hattivatti) Tree(hattivatti) hattivatti hattivatti vatti vatti attivatti i attivatti i t vatti 3,3 6 ttivatti ttivatti ti i 4,5 10 tivatti atti i tivatti 2,5 i hattivatti vatti vatti ivatti ivatti hattivatti ivatti vatti ti vatti 9 5 vatti vatti atti vatti 6,10 vatti tti atti 6,10 8 tti tti atti 7 tivatti 4 ti hattivatti ttivatti ti 1 3 attivatti 2 i substring labels of i edges represented as hattivatti pairs of pointers hattivatti 25 26 Tree(T) is full text index Find att from Tree(hattivatti) hattivatti Tree(T) vatti attivatti t vatti ttivatti P ti i tivatti atti i P occurs in T at vatti ivatti hattivatti locations 8, 31, … ivatti vatti ti vatti vatti atti vatti tti 31 8 tti atti 7 tivatti ti hattivatti ttivatti P occurs in T ó P is a prefix of some suffix of T attivatti ó Path for P exists in Tree(T) i 2 All occurrences of P in time O(|P| + #occ) 27 28 Linear time construction of Tree(T) On-line construction of Trie(T) hattivatti • T = t1t2 … tn$ attivatti • P = t t … t i:th prefix of T ttivatti i 1 2 i tivatti • on-line idea: update Trie(Pi) to Trie(Pi+1) ivatti Weiner McCreight • => very simple construction vatti (1973), (1976) atti ’algorithm tti of the ti year’ ’on-line’ algorithm i (Ukkonen 1992) 29 30 5 12/29/17 Trie(abaab) Trie(abaab) Trie(a) Trie(ab) Trie(aba) Trie(abaa) abaa a a b a b baa a a b a b a b aa b b b b b a a εa a a a ε a a a a chain of links connects the end points of current suffixes 31 32 Trie(abaab) Trie(abaab) Trie(abaa) Trie(abaa) a a b a b a b a a b a b a b b b b b b b a a a a a a a a a a a a a a Add next symbol = b Add next symbol = b From here on b-arc already exists 33 34 Trie(abaab) What happens in Trie(Pi) => Trie(Pi+1) ? Before a a b a b a b a From here on the ai-arc i exists already => stop b b b a a a updating here a a a a Trie(abaab) After ai a b ai ai a b a i a ai a b a New suffix links a b New nodes b 35 36 6 12/29/17 What happens in Trie(Pi) => Trie(Pi+1) ? On-line procedure for suffix trie • time: O(size of Trie(T)) 1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root • suffix links: 2. for i := 2 to n do begin slink(node(aα)) = node(α) 3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf) 4. v := vi-1; v´ := 0 5.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages21 Page
-
File Size-