IMPROVED ALGORITHMS for STRING SEARCHING PROBLEMS Doctoral Dissertation

Total Page:16

File Type:pdf, Size:1020Kb

IMPROVED ALGORITHMS for STRING SEARCHING PROBLEMS Doctoral Dissertation View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Aaltodoc TKK Research Reports in Computer Science and Engineering A TKK-CSE-A1/09 Espoo 2009 IMPROVED ALGORITHMS FOR STRING SEARCHING PROBLEMS Doctoral Dissertation Leena Salmela Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Faculty of Information and Natural Sciences for public examination and debate in Auditorium T2 at Helsinki University of Technology (Espoo, Finland) on the 1st of June, 2009, at 12 noon. Helsinki University of Technology Faculty of Information and Natural Sciences Department of Computer Science and Engineering Teknillinen korkeakoulu Informaatio- ja luonnontieteiden tiedekunta Tietotekniikan laitos Distribution: Helsinki University of Technology Faculty of Information and Natural Sciences Department of Computer Science and Engineering P.O. Box 5400 FI-02015 TKK FINLAND URL: http://www.cse.tkk.fi/ Tel. +358 9 451 3228 Fax +358 9 451 3293 e-mail: [email protected].fi c Leena Salmela c Cover photo: Teemu J. Takanen ISBN 978-951-22-9887-7 ISBN 978-951-22-9888-4 (PDF) ISSN 1797-6928 ISSN 1797-6936 (PDF) URL: http://lib.tkk.fi/Diss/2009/isbn9789512298884/ Multiprint Oy Espoo 2009 AB ABSTRACT OF DOCTORAL DISSERTATION HELSINKI UNIVERSITY OF TECHNOLOGY P. O. BOX 1000, FI-02015 TKK http://www.tkk.fi/ Author Leena Salmela Name of the dissertation Improved Algorithms for String Searching Problems Manuscript submitted 09.02.2009 Manuscript revised 11.05.2009 Date of the defence 01.06.2009 X Monograph Article dissertation (summary + original articles) Faculty Faculty of Information and Natural Sciences Department Department of Computer Science and Engineering Field of research Software Systems Opponent(s) Prof. Maxime Crochemore Supervisor Prof. Jorma Tarhio Instructor Prof. Jorma Tarhio Abstract We present improved practically efficient algorithms for several string searching problems, where we search for a short string called the pattern in a longer string called the text. We are mainly interested in the online problem, where the text is not preprocessed, but we also present a light indexing approach to speed up exact searching of a single pattern. The new algorithms can be applied e.g. to many problems in bioinformatics and other content scanning and filtering problems. In addition to exact string matching, we develop algorithms for several other variations of the string matching problem. We study algorithms for approximate string matching, where a limited number of errors is allowed in the occurrences of the pattern, and parameterized string matching, where a substring of the text matches the pattern if the characters of the substring can be renamed in such a way that the renamed substring matches the pattern exactly. We also consider searching multiple patterns simultaneously and searching weighted patterns, where the weight of a character at a given position reflects the probability of that character occurring at that position. Many of the new algorithms use the backward matching principle, where the characters of the text that are aligned with the pattern are read backward, i.e. from right to left. Another common characteristic of the new algorithms is the use of q-grams, i.e. q consecutive characters are handled as a single character. Many of the new algorithms are bit parallel, i.e. they pack several variables to a single computer word and update all these variables with a single instruction. We show that the q-gram backward string matching algorithms that solve the exact, approximate, or multiple string matching problems are optimal on average. We also show that the q-gram backward string matching algorithm for the parameterized string matching problem is sublinear on average for a class of moderately repetitive patterns. All the presented algorithms are also shown to be fast in practice when compared to earlier algorithms. We also propose an alphabet sampling technique to speed up exact string matching. We choose a subset of the alphabet and select the corresponding subsequence of the text. String matching is then performed on this reduced subsequence and the found matches are verified in the original text. We show how to choose the sampled alphabet optimally and show that the technique speeds up string matching especially for moderate to long patterns. Keywords string matching, approximate string matching, multiple string matching, parameterized string matching, weighted string matching, q-grams, bit parallelism, text indexing ISBN (printed) 978-951-22-9887-7 ISSN (printed) 1797-6928 ISBN (pdf) 978-951-22-9888-4 ISSN (pdf) 1797-6936 Language English Number of pages 153 p. Publisher Department of Computer Science and Engineering Print distribution Department of Computer Science and Engineering X The dissertation can be read at http://lib.tkk.fi/Diss/2009/isbn9789512298884/ AB VÄITÖSKIRJAN TIIVISTELMÄ TEKNILLINEN KORKEAKOULU PL 1000, 02015 TKK http://www.tkk.fi/ Tekijä Leena Salmela Väitöskirjan nimi Parannettuja algoritmeja merkkijonohakuongelmiin Käsikirjoituksen päivämäärä 09.02.2009 Korjatun käsikirjoituksen päivämäärä 11.05.2009 Väitöstilaisuuden ajankohta 01.06.2009 X Monografia Yhdistelmäväitöskirja (yhteenveto + erillisartikkelit) Tiedekunta Informaatio- ja luonnontieteiden tiedekunta Laitos Tietotekniikan laitos Tutkimusala Ohjelmistojärjestelmät Vastaväittäjä(t) Prof. Maxime Crochemore Työn valvoja Prof. Jorma Tarhio Työn ohjaaja Prof. Jorma Tarhio Tiivistelmä Esitämme parannettuja käytännössä tehokkaita algoritmeja useisiin merkkijonohakuongelmiin, joissa etsitään lyhyttä merkkijonoa eli hahmoa pitkästä merkkijonosta eli tekstistä. Keskitymme pääasiassa ongelman muunnelmaan, missä tekstiä ei esikäsitellä, mutta esitämme myös kevyen hakemistorakenteen, joka nopeuttaa yhden hahmon tarkkaa hakua. Esitettyjä uusia algoritmeja voidaan soveltaa mm. moniin bioinformatiikan ongelmiin ja erilaisiin haku- ja suodatus- ongelmiin. Tarkan merkkijonohaun lisäksi kehitämme algoritmeja moniin muihin merkkijonohakuongelmiin. Käsittelemme merkkijonojen likimääräistä hakua, missä sallitaan rajattu määrä virheitä hahmon esiintymissä, ja parametrisoitua hakua, missä tekstin osajono täsmää hahmoon, jos osajonon merkit voidaan nimetä uudelleen siten, että hahmo täsmää tarkasti tähän uudelleennimettyyn osajonoon. Tarkastelemme myös usean hahmon yhtäaikaista hakua ja painotettu- jen hahmojen hakua, missä kunkin merkin paino kussakin positiossa kuvaa kyseisen merkin todennäköisyyttä esiintyä kyseisessä positiossa. Monet uusista algoritmeista lukevat hahmon kanssa kohdistetun tekstin osajonon taaksepäin eli oikealta vasemmalle. Toinen yhteinen piirre esitetyille algoritmeille on q-piirteiden käyttö eli algoritmit käsittelevät q:ta peräkkäistä merkkiä yhtenä merkkinä. Monet näistä uusista algoritmeista ovat bittirinnakkaisia eli ne pakkaavat monta muuttujaa samaan tietokoneen sanaan ja päivittävät kaikkia näitä muuttujia yhdellä käskyllä. Näytämme, että taaksepäin täsmäävät q-piirrealgoritmit, jotka ratkaisevat tarkan, likimääräisen tai usean hahmon merkkijonohaun, ovat keskimäärin optimaalisia. Lisäksi todistamme, että parametrisoidun haun taaksepäin täsmäävä q-piirrealgoritmi on keskimäärin alilineaarinen joukolle kohtalaisen toisteisia hahmoja. Näytämme myös, että kaikki esitetyt algoritmit ovat käytännössä nopeita verrattuna aikaisempiin algoritmeihin. Lopuksi esitämme aakkostonkarsintamenetelmän, joka nopeuttaa tarkkaa merkkijonohakua. Menetelmässä valitaan aakkoston osajoukko ja vastaava tekstin alijono. Hahmoa haetaan tästä lyhennetystä tekstistä ja löydetyt esiintymät tarkistetaan alkuperäisestä tekstistä. Näytämme, miten aakkoston osajoukko valitaan optimaalisesti ja että menetelmä nopeuttaa merkkijonohakua erityisesti kohtalaisen pitkillä hahmoilla. Asiasanat merkkijonohaku, likimääräinen merkkijonohaku, monen hahmon haku, parametrisoitu merkkijonohaku, painotettujen hahmojen haku, q-piirre, bittirinnakkaisuus, tekstin indeksointi ISBN (painettu) 978-951-22-9887-7 ISSN (painettu) 1797-6928 ISBN (pdf) 978-951-22-9888-4 ISSN (pdf) 1797-6936 Kieli Englanti Sivumäärä 153 s. Julkaisija Tietotekniikan laitos Painetun väitöskirjan jakelu Tietotekniikan laitos X Luettavissa verkossa osoitteessa http://lib.tkk.fi/Diss/2009/isbn9789512298884/ Preface First of all, I would like to thank my supervisor, Professor Jorma Tarhio. He recruited me to the String Algorithms Group when I was still an undergraduate student and introduced me to the many interesting problems in string algorithms. I would also like to thank Hannu Peltola who has always been ready to listen to my complaints and questions and to help with the more practical mysteries of university life. Also other members of SAG deserve their thanks. It has been a pleasure to work with you. This thesis was written while I was working at the Department of Computer Sci- ence and Engineering at Helsinki University of Technology. For financial support I would like to thank Helsinki Graduate School in Computer Science and Engineering and the Academy of Finland. I am also grateful for the opportunity to visit University of Chile in April 2008. It was very inspiring to work with Professor Gonzalo Navarro and Francisco Claude. The two weeks away from the usual distractions also proved to be very helpful in getting started on writing this thesis. I would also like to thank the pre-examiners, Professor Esko Ukkonen and Profes- sor Erkki Sutinen, for many useful comments that helped to improve this work. Finally, I would like to thank my family
Recommended publications
  • FM-Index Reveals the Reverse Suffix Array
    FM-Index Reveals the Reverse Suffix Array Arnab Ganguly Department of Computer Science, University of Wisconsin - Whitewater, WI, USA [email protected] Daniel Gibney Department of Computer Science, University of Central Florida, Orlando, FL, USA [email protected] Sahar Hooshmand Department of Computer Science, University of Central Florida, Orlando, FL, USA [email protected] M. Oğuzhan Külekci Informatics Institute, Istanbul Technical University, Turkey [email protected] Sharma V. Thankachan Department of Computer Science, University of Central Florida, Orlando, FL, USA [email protected] Abstract Given a text T [1, n] over an alphabet Σ of size σ, the suffix array of T stores the lexicographic order of the suffixes of T . The suffix array needs Θ(n log n) bits of space compared to the n log σ bits needed to store T itself. A major breakthrough [FM–Index, FOCS’00] in the last two decades has been encoding the suffix array in near-optimal number of bits (≈ log σ bits per character). One can decode a suffix array value using the FM-Index in logO(1) n time. We study an extension of the problem in which we have to also decode the suffix array values of the reverse text. This problem has numerous applications such as in approximate pattern matching [Lam et al., BIBM’ 09]. Known approaches maintain the FM–Index of both the forward and the reverse text which drives up the space occupancy to 2n log σ bits (plus lower order terms). This brings in the natural question of whether we can decode the suffix array values of both the forward and the reverse text, but by using n log σ bits (plus lower order terms).
    [Show full text]
  • The One-Way Communication Complexity of Hamming Distance
    THEORY OF COMPUTING, Volume 4 (2008), pp. 129–135 http://theoryofcomputing.org The One-Way Communication Complexity of Hamming Distance T. S. Jayram Ravi Kumar∗ D. Sivakumar∗ Received: January 23, 2008; revised: September 12, 2008; published: October 11, 2008. Abstract: Consider the following version of the Hamming distance problem for ±1 vec- n √ n √ tors of length n: the promise is that the distance is either at least 2 + n or at most 2 − n, and the goal is to find out which of these two cases occurs. Woodruff (Proc. ACM-SIAM Symposium on Discrete Algorithms, 2004) gave a linear lower bound for the randomized one-way communication complexity of this problem. In this note we give a simple proof of this result. Our proof uses a simple reduction from the indexing problem and avoids the VC-dimension arguments used in the previous paper. As shown by Woodruff (loc. cit.), this implies an Ω(1/ε2)-space lower bound for approximating frequency moments within a factor 1 + ε in the data stream model. ACM Classification: F.2.2 AMS Classification: 68Q25 Key words and phrases: One-way communication complexity, indexing, Hamming distance 1 Introduction The Hamming distance H(x,y) between two vectors x and y is defined to be the number of positions i such that xi 6= yi. Let GapHD denote the following promise problem: the input consists of ±1 vectors n √ n √ x and y of length n, together with the promise that either H(x,y) ≤ 2 − n or H(x,y) ≥ 2 + n, and the goal is to find out which of these two cases occurs.
    [Show full text]
  • Optimal Time and Space Construction of Suffix Arrays and LCP
    Optimal Time and Space Construction of Suffix Arrays and LCP Arrays for Integer Alphabets Keisuke Goto Fujitsu Laboratories Ltd. Kawasaki, Japan [email protected] Abstract Suffix arrays and LCP arrays are one of the most fundamental data structures widely used for various kinds of string processing. We consider two problems for a read-only string of length N over an integer alphabet [1, . , σ] for 1 ≤ σ ≤ N, the string contains σ distinct characters, the construction of the suffix array, and a simultaneous construction of both the suffix array and LCP array. For the word RAM model, we propose algorithms to solve both of the problems in O(N) time by using O(1) extra words, which are optimal in time and space. Extra words means the required space except for the space of the input string and output suffix array and LCP array. Our contribution improves the previous most efficient algorithms, O(N) time using σ + O(1) extra words by [Nong, TOIS 2013] and O(N log N) time using O(1) extra words by [Franceschini and Muthukrishnan, ICALP 2007], for constructing suffix arrays, and it improves the previous most efficient solution that runs in O(N) time using σ + O(1) extra words for constructing both suffix arrays and LCP arrays through a combination of [Nong, TOIS 2013] and [Manzini, SWAT 2004]. 2012 ACM Subject Classification Theory of computation, Pattern matching Keywords and phrases Suffix Array, Longest Common Prefix Array, In-Place Algorithm Acknowledgements We wish to thank Takashi Kato, Shunsuke Inenaga, Hideo Bannai, Dominik Köppl, and anonymous reviewers for their many valuable suggestions on improving the quality of this paper, and especially, we also wish to thank an anonymous reviewer for giving us a simpler algorithm for computing LCP arrays as described in Section 5.
    [Show full text]
  • Fault Tolerant Computing CS 530 Information Redundancy: Coding Theory
    Fault Tolerant Computing CS 530 Information redundancy: Coding theory Yashwant K. Malaiya Colorado State University April 3, 2018 1 Information redundancy: Outline • Using a parity bit • Codes & code words • Hamming distance . Error detection capability . Error correction capability • Parity check codes and ECC systems • Cyclic codes . Polynomial division and LFSRs 4/3/2018 Fault Tolerant Computing ©YKM 2 Redundancy at the Bit level • Errors can bits to be flipped during transmission or storage. • An extra parity bit can detect if a bit in the word has flipped. • Some errors an be corrected if there is enough redundancy such that the correct word can be guessed. • Tukey: “bit” 1948 • Hamming codes: 1950s • Teletype, ASCII: 1960: 7+1 Parity bit • Codes are widely used. 304,805 letters in the Torah Redundancy at the Bit level Even/odd parity (1) • Errors can bits to be flipped during transmission/storage. • Even/odd parity: . is basic method for detecting if one bit (or an odd number of bits) has been switched by accident. • Odd parity: . The number of 1-bit must add up to an odd number • Even parity: . The number of 1-bit must add up to an even number Even/odd parity (2) • The it is known which parity it is being used. • If it uses an even parity: . If the number of of 1-bit add up to an odd number then it knows there was an error: • If it uses an odd: . If the number of of 1-bit add up to an even number then it knows there was an error: • However, If an even number of 1-bit is flipped the parity will still be the same.
    [Show full text]
  • Sequence Distance Embeddings
    Sequence Distance Embeddings by Graham Cormode Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Computer Science January 2003 Contents List of Figures vi Acknowledgments viii Declarations ix Abstract xi Abbreviations xii Chapter 1 Starting 1 1.1 Sequence Distances . ....................................... 2 1.1.1 Metrics ............................................ 3 1.1.2 Editing Distances ...................................... 3 1.1.3 Embeddings . ....................................... 3 1.2 Sets and Vectors ........................................... 4 1.2.1 Set Difference and Set Union . .............................. 4 1.2.2 Symmetric Difference . .................................. 5 1.2.3 Intersection Size ...................................... 5 1.2.4 Vector Norms . ....................................... 5 1.3 Permutations ............................................ 6 1.3.1 Reversal Distance ...................................... 7 1.3.2 Transposition Distance . .................................. 7 1.3.3 Swap Distance ....................................... 8 1.3.4 Permutation Edit Distance . .............................. 8 1.3.5 Reversals, Indels, Transpositions, Edits (RITE) . .................... 9 1.4 Strings ................................................ 9 1.4.1 Hamming Distance . .................................. 10 1.4.2 Edit Distance . ....................................... 10 1.4.3 Block Edit Distances . .................................. 11 1.5 Sequence Distance Problems . .................................
    [Show full text]
  • Dictionary Look-Up Within Small Edit Distance
    Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications
    [Show full text]
  • Approximate String Matching with Reduced Alphabet
    Approximate String Matching with Reduced Alphabet Leena Salmela1 and Jorma Tarhio2 1 University of Helsinki, Department of Computer Science [email protected] 2 Aalto University Deptartment of Computer Science and Engineering [email protected] Abstract. We present a method to speed up approximate string match- ing by mapping the factual alphabet to a smaller alphabet. We apply the alphabet reduction scheme to a tuned version of the approximate Boyer– Moore algorithm utilizing the Four-Russians technique. Our experiments show that the alphabet reduction makes the algorithm faster. Especially in the k-mismatch case, the new variation is faster than earlier algorithms for English data with small values of k. 1 Introduction The approximate string matching problem is defined as follows. We have a pat- tern P [1...m]ofm characters drawn from an alphabet Σ of size σ,atextT [1...n] of n characters over the same alphabet, and an integer k. We need to find all such positions i of the text that the distance between the pattern and a sub- string of the text ending at that position is at most k.Inthek-difference problem the distance between two strings is the standard edit distance where substitu- tions, deletions, and insertions are allowed. The k-mismatch problem is a more restricted one using the Hamming distance where only substitutions are allowed. Among the most cited papers on approximate string matching are the classical articles [1,2] by Esko Ukkonen. Besides them he has studied this topic extensively [3,4,5,6,7,8,9,10,11].
    [Show full text]
  • Searching All Seeds of Strings with Hamming Distance Using Finite Automata
    Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong Searching All Seeds of Strings with Hamming Distance using Finite Automata Ondřej Guth, Bořivoj Melichar ∗ Abstract—Seed is a type of a regularity of strings. Finite automata provide common formalism for many al- A restricted approximate seed w of string T is a fac- gorithms in the area of text processing (stringology), in- tor of T such that w covers a superstring of T under volving forward exact and approximate pattern match- some distance rule. In this paper, the problem of all ing and searching for borders, periods, and repetitions restricted seeds with the smallest Hamming distance [7], backward pattern matching [8], pattern matching in is studied and a polynomial time and space algorithm a compressed text [9], the longest common subsequence for solving the problem is presented. It searches for [10], exact and approximate 2D pattern matching [11], all restricted approximate seeds of a string with given limited approximation using Hamming distance and and already mentioned computing approximate covers it computes the smallest distance for each found seed. [5, 6] and exact covers [12] and seeds [2] in generalized The solution is based on a finite (suffix) automata ap- strings. Therefore, we would like to further extend the set proach that provides a straightforward way to design of problems solved using finite automata. Such a problem algorithms to many problems in stringology. There- is studied in this paper. fore, it is shown that the set of problems solvable using finite automata includes the one studied in this Finite automaton as a data structure may be easily imple- paper.
    [Show full text]
  • Approximate Boyer-Moore String Matching
    APPROXIMATE BOYER-MOORE STRING MATCHING JORMA TARHIO AND ESKO UKKONEN University of Helsinki, Department of Computer Science Teollisuuskatu 23, SF-00510 Helsinki, Finland Draft Abstract. The Boyer-Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length m) in a text string (length n) with at most k mis- matches. Our generalized Boyer-Moore algorithm is shown (under a mild 1 independence assumption) to solve the problem in expected time O(kn( + m – k k)) where c is the size of the alphabet. A related algorithm is developed for the c k differences problem where the task is to find all approximate occurrences of a pattern in a text with ≤ k differences (insertions, deletions, changes). Experimental evaluation of the algorithms is reported showing that the new algorithms are often significantly faster than the old ones. Both algorithms are functionally equivalent with the Horspool version of the Boyer-Moore algorithm when k = 0. Key words: String matching, edit distance, Boyer-Moore algorithm, k mismatches problem, k differences problem AMS (MOS) subject classifications: 68C05, 68C25, 68H05 Abbreviated title: Approximate Boyer-Moore Matching 2 1. Introduction The fastest known exact string matching algorithms are based on the Boyer- Moore idea [BoM77, KMP77]. Such algorithms are “sublinear” on the average in the sense that it is not necessary to check every symbol in the text. The larger is the alphabet and the longer is the pattern, the faster the algorithm works.
    [Show full text]
  • B-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches
    b-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches Shunsuke Kanda Yasuo Tabei RIKEN Center for Advanced Intelligence Project RIKEN Center for Advanced Intelligence Project Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—Recently, randomly mapping vectorial data to algorithms intending to build sketches of non-negative inte- strings of discrete symbols (i.e., sketches) for fast and space- gers (i.e., b-bit sketches) have been proposed for efficiently efficient similarity searches has become popular. Such random approximating various similarity measures. Examples are b-bit mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although minwise hashing (minhash) [12]–[14] for Jaccard similarity, many efficient similarity searches have been proposed, most of 0-bit consistent weighted sampling (CWS) for min-max ker- them are designed for binary sketches. Similarity searches on nel [15], and 0-bit CWS for generalized min-max kernel [16]. integer sketches are in their infancy. In this paper, we present Thus, developing scalable similarity search methods for b-bit a novel space-efficient trie named b-bit sketch trie on integer sketches is a key issue in large-scale applications of similarity sketches for scalable similarity searches by leveraging the idea behind succinct data structures (i.e., space-efficient data structures search. while supporting various data operations in the compressed Similarity searches on binary sketches are classified
    [Show full text]
  • Problem Set 7 Solutions
    Introduction to Algorithms November 18, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 25 Problem Set 7 Solutions Problem 7-1. Edit distance In this problem you will write a program to compute edit distance. This problem is mandatory. Failure to turn in a solution will result in a serious and negative impact on your term grade! We advise you to start this programming assignment as soon as possible, because getting all the details right in a program can take longer than you think. Many word processors and keyword search engines have a spelling correction feature. If you type in a misspelled word x, the word processor or search engine can suggest a correction y. The correction y should be a word that is close to x. One way to measure the similarity in spelling between two text strings is by “edit distance.” The notion of edit distance is useful in other fields as well. For example, biologists use edit distance to characterize the similarity of DNA or protein sequences. The edit distance d(x; y) of two strings of text, x[1 : : m] and y[1 : : n], is defined to be the minimum possible cost of a sequence of “transformation operations” (defined below) that transforms string x[1 : : m] into string y[1 : : n].1 To define the effect of the transformation operations, we use an auxiliary string z[1 : : s] that holds the intermediate results. At the beginning of the transformation sequence, s = m and z[1 : : s] = x[1 : : m] (i.e., we start with string x[1 : : m]).
    [Show full text]
  • A Quick Tour on Suffix Arrays and Compressed Suffix Arrays✩ Roberto Grossi Dipartimento Di Informatica, Università Di Pisa, Italy Article Info a B S T R a C T
    CORE Metadata, citation and similar papers at core.ac.uk Provided by Elsevier - Publisher Connector Theoretical Computer Science 412 (2011) 2964–2973 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs A quick tour on suffix arrays and compressed suffix arraysI Roberto Grossi Dipartimento di Informatica, Università di Pisa, Italy article info a b s t r a c t Keywords: Suffix arrays are a key data structure for solving a run of problems on texts and sequences, Pattern matching from data compression and information retrieval to biological sequence analysis and Suffix array pattern discovery. In their simplest version, they can just be seen as a permutation of Suffix tree f g Text indexing the elements in 1; 2;:::; n , encoding the sorted sequence of suffixes from a given text Data compression of length n, under the lexicographic order. Yet, they are on a par with ubiquitous and Space efficiency sophisticated suffix trees. Over the years, many interesting combinatorial properties have Implicitness and succinctness been devised for this special class of permutations: for instance, they can implicitly encode extra information, and they are a well characterized subset of the nW permutations. This paper gives a short tutorial on suffix arrays and their compressed version to explore and review some of their algorithmic features, discussing the space issues related to their usage in text indexing, combinatorial pattern matching, and data compression. ' 2011 Elsevier B.V. All rights reserved. 1. Suffix arrays Consider a text T ≡ T T1; nU of n symbols over an alphabet Σ, where T TnUD # is an endmarker not occurring elsewhere and smaller than any other symbol in Σ.
    [Show full text]