Sliding Suffix Tree
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Suffix Structure Lecture Moscow International Workshop ACM ICPC
Suffix structure lecture Moscow International Workshop ACM ICPC 2015 19 November, 2015 1 Introduction Consider problem of pattern matching. One of possible statement is as follows: you are given text T and n patterns Si. For each of patterns you are to say whether it has occurrence in text. There are two ways in solving this problem. The first one (considered easier) is Aho-Corasick algorithm which constructs automaton that recognizes strings that contain some of patterns as substring. The second one (considered harder) is usage of suffix structures which are mainly suffix array, suffix automaton or suffix tree. This lecture will be dedicated to the last two, their relations and applications. 2 Naive solution Consider arbitrary substring s[pos; pos + len − 1]. It is the prefix of length len of suffix of string which starts in position pos. Given this, we can use following naive solution for the problem: let’s add all the suffixes of the string in trie. Then for each prefix of every string in trie will correspond exactly one vertex hence we can check whether Si is substring of T in O(Si). Disadvantages are obvious - such solution needs O(jT j2) time and memory. There are two ways of solving this problem which lead to suffix tree and suffix automaton. 1 3 Suffix tree We can see that every top-down way in trie is substring of T . Hence we can remove from trie all nodes which are, neither root nor vertex corresponding to some suffix and its degree equals 2 (i.e., nodes which are not crossroads - they have exactly one ingoing and exactly one outgoing edge). -
Algorithms on Strings
Algorithms on Strings Maxime Crochemore Christophe Hancart Thierry Lecroq Algorithms on Strings Cambridge University Press Algorithms on Strings – Maxime Crochemore, Christophe Han- cart et Thierry Lecroq Table of contents Preface VII 1 Tools 1 1.1 Strings and automata 2 1.2 Some combinatorics 8 1.3 Algorithms and complexity 17 1.4 Implementation of automata 21 1.5 Basic pattern matching techniques 26 1.6 Borders and prefixes tables 36 2 Pattern matching automata 51 2.1 Trie of a dictionary 52 2.2 Searching for several strings 53 2.3 Implementation with failure function 61 2.4 Implementation with successor by default 67 2.5 Searching for one string 76 2.6 Searching for one string and failure function 79 2.7 Searching for one string and successor by default 86 3 String searching with a sliding window 95 3.1 Searching without memory 95 3.2 Searching time 101 3.3 Computing the good suffix table 105 3.4 Automaton of the best factor 109 3.5 Searching with one memory 113 3.6 Searching with several memories 119 3.7 Dictionary searching 128 4 Suffix arrays 137 4.1 Searching a list of strings 138 4.2 Searching with common prefixes 141 4.3 Preprocessing the list 146 VI Table of contents 4.4 Sorting suffixes 147 4.5 Sorting suffixes on bounded integer alphabets 153 4.6 Common prefixes of the suffixes 158 5 Structures for index 165 5.1 Suffix trie 165 5.2 Suffix tree 171 5.3 Contexts of factors 180 5.4 Suffix automaton 185 5.5 Compact suffix automaton 197 6 Indexes 205 6.1 Implementing an index 205 6.2 Basic operations 208 6.3 Transducer of positions 213 6.4 Repetitions -
Finite Automata Implementations Considering CPU Cache J
Acta Polytechnica Vol. 47 No. 6/2007 Finite Automata Implementations Considering CPU Cache J. Holub The finite automata are mathematical models for finite state systems. More general finite automaton is the nondeterministic finite automaton (NFA) that cannot be directly used. It is usually transformed to the deterministic finite automaton (DFA) that then runs in time O(n), where n is the size of the input text. We present two main approaches to practical implementation of DFA considering CPU cache. The first approach (represented by Table Driven and Hard Coded implementations) is suitable forautomata being run very frequently, typically having cycles. The other approach is suitable for a collection of automata from which various automata are retrieved and then run. This second kind of automata are expected to be cycle-free. Keywords: deterministic finite automaton, CPU cache, implementation. 1 Introduction usage. Section 3 then describes general techniques for DFA implementation. It is mostly suitable for DFA that is run most The original formal study of finite state systems (neural of the time. Since DFA has a finite set of states, this kind of nets) is from 1943 by McCulloch and Pitts [14]. In 1956 DFA has to have cycles. Recent results in the implementation Kleene [13] modeled the neural nets of McCulloch and Pitts using CPU cache are discussed in Section 4. On the other by finite automata. In that time similar models were pre- hand we have a collection of DFAs each representing some sented by Huffman [12], Moore [17], and Mealy [15]. In 1959, document (e.g., in the form of complete index in case of Rabin and Scott introduced nondeterministic finite automata factor or suffix automata). -
Starting the Matching from the End Enables Long Shifts. • the Horspool
BNDM Starting the matching from the end enables long shifts. The Horspool algorithm bases the shift on a single character. • The Boyer–Moore algorithm uses the matching suffix and the • mismatching character. Factor based algorithms continue matching until no pattern factor • matches. This may require more comparisons but it enables longer shifts. Example 2.14: Horspool shift varmasti-aikai/sen-ainainen ainaisen-ainainen ainaisen-ainainen Boyer–Moore shift Factor shift varmasti-aikai/sen-ainainen varmasti-ai/kaisen-ainainen ainaisen-ainainen ainaisen-ainainen ainaisen-ainainen ainaisen-ainainen 86 Factor based algorithms use an automaton that accepts suffixes of the reverse pattern P R (or equivalently reverse prefixes of the pattern P ). BDM (Backward DAWG Matching) uses a deterministic automaton • that accepts exactly the suffixes of P R. DAWG (Directed Acyclic Word Graph) is also known as suffix automaton. BNDM (Backward Nondeterministic DAWG Matching) simulates a • nondeterministic automaton. Example 2.15: P = assi. ε ai ss -10123 BOM (Backward Oracle Matching) uses a much simpler deterministic • automaton that accepts all suffixes of P R but may also accept some other strings. This can cause shorter shifts but not incorrect behaviour. 87 Suppose we are currently comparing P against T [j..j + m). We use the automaton to scan the text backwards from T [j + m 1]. When the automaton has scanned T [j + i..j + m): − If the automaton is in an accept state, then T [j + i..j + m) is a prefix • of P . If i = 0, we found an occurrence. ⇒ Otherwise, mark the prefix match by setting shift = i. This is the ⇒ length of the shift that would achieve a matching alignment. -
Tries and String Matching
Tries and String Matching Where We've Been ● Fundamental Data Structures ● Red/black trees, B-trees, RMQ, etc. ● Isometries ● Red/black trees ≡ 2-3-4 trees, binomial heaps ≡ binary numbers, etc. ● Amortized Analysis ● Aggregate, banker's, and potential methods. Where We're Going ● String Data Structures ● Data structures for storing and manipulating text. ● Randomized Data Structures ● Using randomness as a building block. ● Integer Data Structures ● Breaking the Ω(n log n) sorting barrier. ● Dynamic Connectivity ● Maintaining connectivity in an changing world. String Data Structures Text Processing ● String processing shows up everywhere: ● Computational biology: Manipulating DNA sequences. ● NLP: Storing and organizing huge text databases. ● Computer security: Building antivirus databases. ● Many problems have polynomial-time solutions. ● Goal: Design theoretically and practically efficient algorithms that outperform brute-force approaches. Outline for Today ● Tries ● A fundamental building block in string processing algorithms. ● Aho-Corasick String Matching ● A fast and elegant algorithm for searching large texts for known substrings. Tries Ordered Dictionaries ● Suppose we want to store a set of elements supporting the following operations: ● Insertion of new elements. ● Deletion of old elements. ● Membership queries. ● Successor queries. ● Predecessor queries. ● Min/max queries. ● Can use a standard red/black tree or splay tree to get (worst-case or expected) O(log n) implementations of each. A Catch ● Suppose we want to store a set of strings. ● Comparing two strings of lengths r and s takes time O(min{r, s}). ● Operations on a balanced BST or splay tree now take time O(M log n), where M is the length of the longest string in the tree. -
Algorithm for Character Recognition Based on the Trie Structure
University of Montana ScholarWorks at University of Montana Graduate Student Theses, Dissertations, & Professional Papers Graduate School 1987 Algorithm for character recognition based on the trie structure Mohammad N. Paryavi The University of Montana Follow this and additional works at: https://scholarworks.umt.edu/etd Let us know how access to this document benefits ou.y Recommended Citation Paryavi, Mohammad N., "Algorithm for character recognition based on the trie structure" (1987). Graduate Student Theses, Dissertations, & Professional Papers. 5091. https://scholarworks.umt.edu/etd/5091 This Thesis is brought to you for free and open access by the Graduate School at ScholarWorks at University of Montana. It has been accepted for inclusion in Graduate Student Theses, Dissertations, & Professional Papers by an authorized administrator of ScholarWorks at University of Montana. For more information, please contact [email protected]. COPYRIGHT ACT OF 1 9 7 6 Th is is an unpublished manuscript in which copyright sub s i s t s , Any further r e p r in t in g of its contents must be approved BY THE AUTHOR, Ma n sfield L ibrary U n iv e r s it y of Montana Date : 1 987__ AN ALGORITHM FOR CHARACTER RECOGNITION BASED ON THE TRIE STRUCTURE By Mohammad N. Paryavi B. A., University of Washington, 1983 Presented in partial fulfillment of the requirements for the degree of Master of Science University of Montana 1987 Approved by lairman, Board of Examiners iean, Graduate School UMI Number: EP40555 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. -
Efficient Top-K Algorithms for Approximate Substring Matching
Efficient Top-k Algorithms for Approximate Substring Matching Younghoon Kim Kyuseok Shim Seoul National University Seoul National University Seoul, Korea Seoul, Korea [email protected] [email protected] ABSTRACT to deal with vast amounts of data, retrieving similar strings There is a wide range of applications that require to query becomes a more challenging problem today. a large database of texts to search for similar strings or To handle approximate substring matching, various string substrings. Traditional approximate substring matching re- (dis)similarity measures, such as edit distance, hamming dis- quests a user to specify a similarity threshold. Without top- tance, Jaccard coefficient, cosine similarity and Jaro-Winkler k approximate substring matching, users have to try repeat- distance have been considered [2, 10, 11, 19]. Among them, edly different maximum distance threshold values when the the edit distance is one of the most widely accepted distance proper threshold is unknown in advance. measures for database applications where domain specific In our paper, we first propose the efficient algorithms knowledge is not really available [10, 12, 20, 27]. for finding the top-k approximate substring matches with Based on the edit distance, the approximate string match- a given query string in a set of data strings. To reduce the ing problem is to find every string from a string database number of expensive distance computations, the proposed whose edit distance to the query string is not larger than a algorithms utilize our novel filtering techniques which take given maximum distance threshold. There are diverse appli- advantages of q-grams and inverted q-gram indexes avail- cations requiring approximate string matching. -
Text Searching Algorithms
Department of Computer Science and Engineering Faculty of Electrical Engineering Czech Technical University in Prague TEXT SEARCHING ALGORITHMS SEMINARS Bo·rivoj Melichar, Jan Anto·s, Jan Holub, Tom¶a·s Polcar, and Michal Vor¶a·cek December 21, 2005 Preface \Practice makes perfect." The aim of this tutorial text is to facilitate deeper understanding of principles and ap- plications of text searching algorithms. It provides many exercises from di®erent areas of this topic. The front part of this text is devoted to review of basic notions, principles and algorithms from the theory of ¯nite automata. The reason for it is the reality that the next chapters devoted to di®erent aspects of the area of text searching algorithms are based on the use of ¯nite automata. Chapter 1 contains collection of de¯nitions of notions used in this tutorial. Chapter 2 is devoted to the basic algorithms from the area of ¯nite automata and has been written by Jan Anto·s. Chapter 3 contains basic algorithms and operations with ¯nite automata and it has been partly written by Jan Anto·s. Chapters 4 and 5 show construction of string matching automata for exact and approximate string matching and has been partly written by Tom¶a·s Polcar. Chapter 6, devoted to the simulation of string matching automata has been written by Jan Holub. Chapter 7 and 8 concerning construction of ¯nite automata accepting parts of strings has been partly written by Tom¶a·s Polcar. Chapter 9 describing computation of border arrays has been written by Michal Vor¶a·cek. -
Text Searching: Theory and Practice 1 Introduction
Text Searching: Theory and Practice Ricardo A. Baeza-Yates and Gonzalo Navarro Depto. de Ciencias de la Computaci´on, Universidad de Chile, Casilla 2777, Santiago, Chile. E-mail: rbaeza,gnavarro @dcc.uchile.cl. { } Abstract We present the state of the art of the main component of text retrieval systems: the search engine. We outline the main lines of research and issues involved. We survey the relevant techniques in use today for text searching and explore the gap between theoretical and practical algorithms. The main observation is that simpler ideas are better in practice. In theory, The best theory is inspired by practice. there is no difference between theory and practice. The best practice is inspired by theory. In practice, there is. Donald E. Knuth Jan L.A. van de Snepscheut 1 Introduction Full text retrieval systems have become a popular way of providing support for text databases. The full text model permits locating the occurrences of any word, sentence, or simply substring in any document of the collection. Its main advantages, as compared to alternative text retrieval models, are simplicity and efficiency. From the end-user point of view, full text searching of on- line documents is appealing because valid query patterns are just any word or sentence of the documents. In the past, alternatives to the full-text model, such as semantic indexing, had received some attention. Nowadays, those alternatives are confined to very specific domains while most text retrieval systems used in practice rely, in one way or another, on the full text model. The main component of a full text retrieval system is the text search engine. -
Substring Alignment Using Suffix Trees
Substring Alignment using Suffix Trees Martin Kay Stanford University Abstract. Alignment of the sentences of an original text and a translation is considerably better understood than alignment of smaller units such as words and phrases. This paper makes some preliminary proposals for solving the problem of aligning substrings that should be treated as basic translation unites even though they may not begin and end at word boundaries. The pro- posals make crucial use of suffix trees as a way of identifying repeated sub- strings of the texts that occur significantly often. It is fitting that one should take advantage of the few occasions on which one is invited to address an important and prestigious professional meeting like this to depart from the standard practice of reporting results and instead to seek adherents to a new enterprise, even if it is one the details of which one can only partially discern. In this case, I intend to take that opportunity to propose a new direction for a line of work that I first became involved in in the early 1990’s [3]. It had to do with the automatic alignment of the sen- tences of a text with those in its translation into another language. The problem is non- trivial because translators frequently translate one sentence with two, or two with one. Sometimes the departure from the expected one-to-one alignment is even greater. We undertook this work not so much because we thought it was of great importance but be- cause it seemed to us rather audacious to attempt to establish these alignments on the basis of no a priori assumptions about the languages involved or about correspondences between pairs of words. -
Suffix and Factor Automata and Combinatorics on Words
Suffix and Factor Automata and Combinatorics on Words Gabriele Fici Workshop PRIN 2007–2009 Varese – 5 September 2011 Gabriele Fici Suffix and Factor Automata and Combinatorics on Words The Suffix Automaton Definition (A. Blumer et al. 85 — M. Crochemore 86) The Suffix Automaton of the word w is the minimal deterministic automaton recognizing the suffixes of w. Example The SA of w = aabbabb a a b b a b b 0 1 2 3 4 5 6 7 a b b 3′′ 4′′ b a b 3′ Gabriele Fici Suffix and Factor Automata and Combinatorics on Words The SA has several applications, for example in pattern matching music retrieval spam detection search of characteristic expressions in literary works speech recordings alignment ... Algorithmic Construction The SA allows the search of a pattern v in a text w in time and space O(jvj). Moreover: Theorem (A. Blumer et al. 85 — M. Crochemore 86) The SA of a word w over a fixed alphabet Σ can be built in time and space O(jwj). Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Algorithmic Construction The SA allows the search of a pattern v in a text w in time and space O(jvj). Moreover: Theorem (A. Blumer et al. 85 — M. Crochemore 86) The SA of a word w over a fixed alphabet Σ can be built in time and space O(jwj). The SA has several applications, for example in pattern matching music retrieval spam detection search of characteristic expressions in literary works speech recordings alignment ... Gabriele Fici Suffix and Factor Automata and Combinatorics on Words Determinize by subset construction: a a b b a b b {0, 1, 2,..., 7} -
A Heuristic Algorithm for the N-LCS Problem1
Journal of the Applied Mathematics, Statistics and Informatics (JAMSI), 4 (2008), No. 1 A Heuristic Algorithm for the N-LCS Problem1 MOURAD ELLOUMI AND AHMED MOKADDEM ______________________________________________________________________ Abstract Let f={w1, w2,…,wN}beasetofN strings, a string s is an N-Common Subsequence (N-CS) of f,ifandonlyif, s is a subsequence of each of the N strings of f. Hence, the N-Longest Common Subsequence (N-LCS) problem is defined as follows: Given a set of N strings f={w1, w2,…,wN}findanN-CS of f of maximum length. When N>2, the N-LCS problem becomes NP-hard [Maier 78] and even hard to be approximated [Jiang and Li 95]. In this paper, we present A heuristic algorithm, adopting a regions approach, for the N-LCS problem, where N12. Our algorithm is O(N*L*log(L)) in computing time, where L is the maximum length of a string. During each recursive call to our algorithm : We look for the longest common substrings that appear in the different strings. If we have more than one longest common substring, we do a filtering. Additional Key Words and Phrases: strings, common subsequences, divide-and-conquer strategy, algorithms, complexities. Mathematics Subjekt Classification 2000: 68T20, 68Q25 _______________________________________________________________________ INTRODUCTION Let s = s1s2 ... sm and w = w1w2 ... wn be two strings, we say that a string s is a subsequence of a string w if and only if there is a mapping F : {l, 2, ... , m} → {1, 2, ... , n}suchthatF(i)=k if and only if si= wk and F is a monotone strictly increasing function, i.e., F(i) = u, F(j) = v and i<j imply that u<v [Hirschberg 75].