IMPROVED ALGORITHMS for STRING SEARCHING PROBLEMS Doctoral Dissertation

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Aaltodoc TKK Research Reports in Computer Science and Engineering A TKK-CSE-A1/09 Espoo 2009 IMPROVED ALGORITHMS FOR STRING SEARCHING PROBLEMS Doctoral Dissertation Leena Salmela Dissertation for the degree of Doctor of Science in Technology to be presented with due permission of the Faculty of Information and Natural Sciences for public examination and debate in Auditorium T2 at Helsinki University of Technology (Espoo, Finland) on the 1st of June, 2009, at 12 noon. Helsinki University of Technology Faculty of Information and Natural Sciences Department of Computer Science and Engineering Teknillinen korkeakoulu Informaatio- ja luonnontieteiden tiedekunta Tietotekniikan laitos Distribution: Helsinki University of Technology Faculty of Information and Natural Sciences Department of Computer Science and Engineering P.O. Box 5400 FI-02015 TKK FINLAND URL: http://www.cse.tkk.fi/ Tel. +358 9 451 3228 Fax +358 9 451 3293 e-mail: [email protected].fi c Leena Salmela c Cover photo: Teemu J. Takanen ISBN 978-951-22-9887-7 ISBN 978-951-22-9888-4 (PDF) ISSN 1797-6928 ISSN 1797-6936 (PDF) URL: http://lib.tkk.fi/Diss/2009/isbn9789512298884/ Multiprint Oy Espoo 2009 AB ABSTRACT OF DOCTORAL DISSERTATION HELSINKI UNIVERSITY OF TECHNOLOGY P. O. BOX 1000, FI-02015 TKK http://www.tkk.fi/ Author Leena Salmela Name of the dissertation Improved Algorithms for String Searching Problems Manuscript submitted 09.02.2009 Manuscript revised 11.05.2009 Date of the defence 01.06.2009 X Monograph Article dissertation (summary + original articles) Faculty Faculty of Information and Natural Sciences Department Department of Computer Science and Engineering Field of research Software Systems Opponent(s) Prof. Maxime Crochemore Supervisor Prof. Jorma Tarhio Instructor Prof. Jorma Tarhio Abstract We present improved practically efficient algorithms for several string searching problems, where we search for a short string called the pattern in a longer string called the text. We are mainly interested in the online problem, where the text is not preprocessed, but we also present a light indexing approach to speed up exact searching of a single pattern. The new algorithms can be applied e.g. to many problems in bioinformatics and other content scanning and filtering problems. In addition to exact string matching, we develop algorithms for several other variations of the string matching problem. We study algorithms for approximate string matching, where a limited number of errors is allowed in the occurrences of the pattern, and parameterized string matching, where a substring of the text matches the pattern if the characters of the substring can be renamed in such a way that the renamed substring matches the pattern exactly. We also consider searching multiple patterns simultaneously and searching weighted patterns, where the weight of a character at a given position reflects the probability of that character occurring at that position. Many of the new algorithms use the backward matching principle, where the characters of the text that are aligned with the pattern are read backward, i.e. from right to left. Another common characteristic of the new algorithms is the use of q-grams, i.e. q consecutive characters are handled as a single character. Many of the new algorithms are bit parallel, i.e. they pack several variables to a single computer word and update all these variables with a single instruction. We show that the q-gram backward string matching algorithms that solve the exact, approximate, or multiple string matching problems are optimal on average. We also show that the q-gram backward string matching algorithm for the parameterized string matching problem is sublinear on average for a class of moderately repetitive patterns. All the presented algorithms are also shown to be fast in practice when compared to earlier algorithms. We also propose an alphabet sampling technique to speed up exact string matching. We choose a subset of the alphabet and select the corresponding subsequence of the text. String matching is then performed on this reduced subsequence and the found matches are verified in the original text. We show how to choose the sampled alphabet optimally and show that the technique speeds up string matching especially for moderate to long patterns. Keywords string matching, approximate string matching, multiple string matching, parameterized string matching, weighted string matching, q-grams, bit parallelism, text indexing ISBN (printed) 978-951-22-9887-7 ISSN (printed) 1797-6928 ISBN (pdf) 978-951-22-9888-4 ISSN (pdf) 1797-6936 Language English Number of pages 153 p. Publisher Department of Computer Science and Engineering Print distribution Department of Computer Science and Engineering X The dissertation can be read at http://lib.tkk.fi/Diss/2009/isbn9789512298884/ AB VÄITÖSKIRJAN TIIVISTELMÄ TEKNILLINEN KORKEAKOULU PL 1000, 02015 TKK http://www.tkk.fi/ Tekijä Leena Salmela Väitöskirjan nimi Parannettuja algoritmeja merkkijonohakuongelmiin Käsikirjoituksen päivämäärä 09.02.2009 Korjatun käsikirjoituksen päivämäärä 11.05.2009 Väitöstilaisuuden ajankohta 01.06.2009 X Monografia Yhdistelmäväitöskirja (yhteenveto + erillisartikkelit) Tiedekunta Informaatio- ja luonnontieteiden tiedekunta Laitos Tietotekniikan laitos Tutkimusala Ohjelmistojärjestelmät Vastaväittäjä(t) Prof. Maxime Crochemore Työn valvoja Prof. Jorma Tarhio Työn ohjaaja Prof. Jorma Tarhio Tiivistelmä Esitämme parannettuja käytännössä tehokkaita algoritmeja useisiin merkkijonohakuongelmiin, joissa etsitään lyhyttä merkkijonoa eli hahmoa pitkästä merkkijonosta eli tekstistä. Keskitymme pääasiassa ongelman muunnelmaan, missä tekstiä ei esikäsitellä, mutta esitämme myös kevyen hakemistorakenteen, joka nopeuttaa yhden hahmon tarkkaa hakua. Esitettyjä uusia algoritmeja voidaan soveltaa mm. moniin bioinformatiikan ongelmiin ja erilaisiin haku- ja suodatus- ongelmiin. Tarkan merkkijonohaun lisäksi kehitämme algoritmeja moniin muihin merkkijonohakuongelmiin. Käsittelemme merkkijonojen likimääräistä hakua, missä sallitaan rajattu määrä virheitä hahmon esiintymissä, ja parametrisoitua hakua, missä tekstin osajono täsmää hahmoon, jos osajonon merkit voidaan nimetä uudelleen siten, että hahmo täsmää tarkasti tähän uudelleennimettyyn osajonoon. Tarkastelemme myös usean hahmon yhtäaikaista hakua ja painotettujen hahmojen hakua, missä kunkin merkin paino kussakin positiossa kuvaa kyseisen merkin todennäköisyyttä esiintyä kyseisessä positiossa. Monet uusista algoritmeista lukevat hahmon kanssa kohdistetun tekstin osajonon taaksepäin eli oikealta vasemmalle. Toinen yhteinen piirre esitetyille algoritmeille on q-piirteiden käyttö eli algoritmit käsittelevät q:ta peräkkäistä merkkiä yhtenä merkkinä. Monet näistä uusista algoritmeista ovat bittirinnakkaisia eli ne pakkaavat monta muuttujaa samaan tietokoneen sanaan ja päivittävät kaikkia näitä muuttujia yhdellä käskyllä. Näytämme, että taaksepäin täsmäävät q-piirrealgoritmit, jotka ratkaisevat tarkan, likimääräisen tai usean hahmon merkkijonohaun, ovat keskimäärin optimaalisia. Lisäksi todistamme, että parametrisoidun haun taaksepäin täsmäävä q-piirrealgoritmi on keskimäärin alilineaarinen joukolle kohtalaisen toisteisia hahmoja. Näytämme myös, että kaikki esitetyt algoritmit ovat käytännössä nopeita verrattuna aikaisempiin algoritmeihin. Lopuksi esitämme aakkostonkarsintamenetelmän, joka nopeuttaa tarkkaa merkkijonohakua. Menetelmässä valitaan aakkoston osajoukko ja vastaava tekstin alijono. Hahmoa haetaan tästä lyhennetystä tekstistä ja löydetyt esiintymät tarkistetaan alkuperäisestä tekstistä. Näytämme, miten aakkoston osajoukko valitaan optimaalisesti ja että menetelmä nopeuttaa merkkijonohakua erityisesti kohtalaisen pitkillä hahmoilla. Asiasanat merkkijonohaku, likimääräinen merkkijonohaku, monen hahmon haku, parametrisoitu merkkijonohaku, painotettujen hahmojen haku, q-piirre, bittirinnakkaisuus, tekstin indeksointi ISBN (painettu) 978-951-22-9887-7 ISSN (painettu) 1797-6928 ISBN (pdf) 978-951-22-9888-4 ISSN (pdf) 1797-6936 Kieli Englanti Sivumäärä 153 s. Julkaisija Tietotekniikan laitos Painetun väitöskirjan jakelu Tietotekniikan laitos X Luettavissa verkossa osoitteessa http://lib.tkk.fi/Diss/2009/isbn9789512298884/ Preface First of all, I would like to thank my supervisor, Professor Jorma Tarhio. He recruited me to the String Algorithms Group when I was still an undergraduate student and introduced me to the many interesting problems in string algorithms. I would also like to thank Hannu Peltola who has always been ready to listen to my complaints and questions and to help with the more practical mysteries of university life. Also other members of SAG deserve their thanks. It has been a pleasure to work with you. This thesis was written while I was working at the Department of Computer Sci- ence and Engineering at Helsinki University of Technology. For financial support I would like to thank Helsinki Graduate School in Computer Science and Engineering and the Academy of Finland. I am also grateful for the opportunity to visit University of Chile in April 2008. It was very inspiring to work with Professor Gonzalo Navarro and Francisco Claude. The two weeks away from the usual distractions also proved to be very helpful in getting started on writing this thesis. I would also like to thank the pre-examiners, Professor Esko Ukkonen and Profes- sor Erkki Sutinen, for many useful comments that helped to improve this work. Finally, I would like to thank my family

IMPROVED ALGORITHMS for STRING SEARCHING PROBLEMS Doctoral Dissertation

FM-Index Reveals the Reverse Suffix Array

The One-Way Communication Complexity of Hamming Distance

Optimal Time and Space Construction of Suffix Arrays and LCP

Fault Tolerant Computing CS 530 Information Redundancy: Coding Theory

Sequence Distance Embeddings

Dictionary Look-Up Within Small Edit Distance

Approximate String Matching with Reduced Alphabet

Searching All Seeds of Strings with Hamming Distance Using Finite Automata

Approximate Boyer-Moore String Matching

B-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches

Problem Set 7 Solutions

A Quick Tour on Suffix Arrays and Compressed Suffix Arrays✩ Roberto Grossi Dipartimento Di Informatica, Università Di Pisa, Italy Article Info a B S T R a C T