Lecture 135 Full Text New

12/29/17 Key take home messages • Suffix trie, Suffix tree Text Algorithms (6EAP) – Application examples Full text indexing • Suffix array Jaak Vilo • Burrows Wheeler Transform (BWT) 2017 fall – Compact Suffix tree representation: BWT + succinct Jaak Vilo MTAT.03.190 Text Algorithms 1 Problem E.g. Dictionary problem • Given P and S – find all exact or approximate • Does P belong to a dictionary D={d1,…,dn} occurrences of P in S – Build a binary search tree of D – B-Tree of D • You are allowed to preprocess S (and P, of – Hashing course) – Sorting + Binary search • Build a keyword trie: search in O(|P|) • Goal: to speed up the searches – Assuming alphabet has up to a constant size c – See Aho-Corasick algorithm, Trie construction Sorted array and binary search Sorted array and binary search 1 13 1 13 global info global info head search head search he stop he stop header informal header informal hers hers index index happy his show happy his show O( |P| log n ) 1 12/29/17 Trie for D={ he, hers, his, she} S != set of words 0 • S of length n h s 1 3 • How to index? i e h 2 6 4 • Index from every position of a text r s e 8 5 s 7 • Prefix of every possible suffix is important 9 O( |P| ) Suffix tree • Definition: A compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with Trie(babaab) their parents. • Definition ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a a b babaab rooted, labeled tree with a leaf for each non-empty suffix of S. b a Furthermore, a suffix tree satisfies the following properties: abaab a a b • Each internal node, other than the root, has at least two children; baab b • Each edge leaving a particular node is labeled with a non-empty substring aab a a a of S of which the first symbol is unique among all first symbols of the edge ab b b a labels of the edges leaving this particular node; b • For any leaf in the tree, the concatenation of the edge labels on the path b from the root to this leaf exactly spells out a non-empty suffix of s. • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Hardcover - 534 pages 1st edition (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. Literature on suffix trees • http://en.wikipedia.org/wiki/Suffix_tree • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Hardcover - 534 pages 1st edition (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: http://stackoverflow.com/questions/9452701/u 89--208) kkonens-suffix-tree-algorithm-in-plain-english • E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249-60, 1995. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751 • Ching-Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-105, January, 2005. http://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3 • CPM articles archive: http://www.cs.ucr.edu/~stelo/cpm/ • Mark Nelson. Fast String Searching With Suffix Trees Dr. Dobb's Journal, August, 1996. http://www.dogma.net/markn/articles/suffixt/suffixt.htm 2 12/29/17 Partly based on : Suffix tree and suffix array techniques for pattern ttttttttttttttgagacggagtctcgctctg analysis in strings tcgcccaggctggagtgcagtggcggg Esko Ukkonen atctcggctcactgcaagctccgcctcc Univ Helsinki cgggttcacgccattctcctgcctcagcc Erice School 30 Oct 2005 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt tcccaagtagctgggactacaggcgcc cgccactacgcccggctaattttttgtattt High-throughput genome-scale sequence analysis and ttagtagagacggggtttcaccgttttagc mapping using compressed data structures Veli Mäkinen cgggatggtctcgatctcctgacctcgtg Department of Computer Science atccgcccgcctcggcctcccaaagtgc University of Helsinki tgggattacaggcgt 13 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt14 Solution: backtracking with suffix Analysis of a string of symbols tree • T = h a t t i v a t t i ’text’ • P = a t t ’pattern’ • Find the occurrences of P in T: h a t t i v a t t i • Pattern synthesis: #(t) = 4 #(atti) = 2 #(t****t) = 2 ...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG..... E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt15 ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 16 Pattern finding & synthesis problems • T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet 1. Suffix tree • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different 2. Suffix array patterns P can be found fast – static text, any given pattern P 3. Some applications • Pattern synthesis problem: Learn from T new patterns that occur surprisingly often 4. Finding motifs • What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, … 17 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt18 3 12/29/17 The suffix tree Tree(T) of T Suffix trie and suffix tree • abaab data structure suffix tree, Tree(T), is baab Trie(abaab) compacted trie that represents all the suffixes aab of string T ab a b b b • linear size: |Tree(T)| = O(|T|) a a a b a • can be constructed in linear time O(|T|) a b • has myriad virtues (A. Apostolico) b • is well-known: 366 000 Google hits E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt19 E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt20 Suffix trie and suffix tree Trie(T) can be large abaab • 2 baab |Trie(T)| = O(|T| ) aab • bad example: T = anbn ab b • Trie(T) can be seen as a DFA: language Trie(abaab) Tree(abaab) accepted = the suffixes of T a b a • minimize the DFA => directed cyclic word baab b a a ab graph (’DAWG’) a baab b a a b b 21 22 Tree(T) is of linear size Tree(hattivatti) hattivatti vatti • attivatti only the internal branching nodes and the t i vatti leaves represented explicitly ttivatti ti i tivatti atti i • edges labeled by substrings of T vatti ivatti hattivatti • v = node(α) if the path from root to v spells α ivatti vatti ti • vatti vatti one-to-one correspondence of leaves and atti vatti tti suffixes tti atti tivatti ti hattivatti ttivatti • |T| leaves, hence < |T| internal nodes attivatti • |Tree(T)| = O(|T| + size(edge labels)) i 23 24 4 12/29/17 Tree(hattivatti) Tree(hattivatti) hattivatti hattivatti vatti vatti attivatti i attivatti i t vatti 3,3 6 ttivatti ttivatti ti i 4,5 10 tivatti atti i tivatti 2,5 i hattivatti vatti vatti ivatti ivatti hattivatti ivatti vatti ti vatti 9 5 vatti vatti atti vatti 6,10 vatti tti atti 6,10 8 tti tti atti 7 tivatti 4 ti hattivatti ttivatti ti 1 3 attivatti 2 i substring labels of i edges represented as hattivatti pairs of pointers hattivatti 25 26 Tree(T) is full text index Find att from Tree(hattivatti) hattivatti Tree(T) vatti attivatti t vatti ttivatti P ti i tivatti atti i P occurs in T at vatti ivatti hattivatti locations 8, 31, … ivatti vatti ti vatti vatti atti vatti tti 31 8 tti atti 7 tivatti ti hattivatti ttivatti P occurs in T ó P is a prefix of some suffix of T attivatti ó Path for P exists in Tree(T) i 2 All occurrences of P in time O(|P| + #occ) 27 28 Linear time construction of Tree(T) On-line construction of Trie(T) hattivatti • T = t1t2 … tn$ attivatti • P = t t … t i:th prefix of T ttivatti i 1 2 i tivatti • on-line idea: update Trie(Pi) to Trie(Pi+1) ivatti Weiner McCreight • => very simple construction vatti (1973), (1976) atti ’algorithm tti of the ti year’ ’on-line’ algorithm i (Ukkonen 1992) 29 30 5 12/29/17 Trie(abaab) Trie(abaab) Trie(a) Trie(ab) Trie(aba) Trie(abaa) abaa a a b a b baa a a b a b a b aa b b b b b a a εa a a a ε a a a a chain of links connects the end points of current suffixes 31 32 Trie(abaab) Trie(abaab) Trie(abaa) Trie(abaa) a a b a b a b a a b a b a b b b b b b b a a a a a a a a a a a a a a Add next symbol = b Add next symbol = b From here on b-arc already exists 33 34 Trie(abaab) What happens in Trie(Pi) => Trie(Pi+1) ? Before a a b a b a b a From here on the ai-arc i exists already => stop b b b a a a updating here a a a a Trie(abaab) After ai a b ai ai a b a i a ai a b a New suffix links a b New nodes b 35 36 6 12/29/17 What happens in Trie(Pi) => Trie(Pi+1) ? On-line procedure for suffix trie • time: O(size of Trie(T)) 1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root • suffix links: 2. for i := 2 to n do begin slink(node(aα)) = node(α) 3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf) 4. v := vi-1; v´ := 0 5.

Lecture 135 Full Text New

Compressed Suffix Trees with Full Functionality

Suffix Trees

Search Trees

Lecture 1: Introduction

Suffix Trees for Fast Sensor Data Forwarding

Suffix Trees and Suffix Arrays in Primary and Secondary Storage Pang Ko Iowa State University

Finger Search Trees

Lecture Notes of CSCI5610 Advanced Data Structures

String Algorithm

Pointer-Machine Algorithms for Fully-Online Construction of Suffix

Analysis of Algorithms and Data Structures for Text Indexing

4. Suffix Trees and Arrays