Proceedings of the Prague Stringology Conference 2014

Proceedings of the Prague Stringology Conference 2014 Edited by Jan Holub and Jan Zdˇ 'árek September 2014 PSC Prague Stringology Club http://www.stringology.org/ Conference Organisation Program Committee Amihood Amir (Bar-Ilan University, Israel) Gabriela Andrejková (P. J. SafárikUniversity,ˇ Slovakia) Maxime Crochemore (King's College London, United Kingdom) Simone Faro (Universitàdi Catania, Italy) FrantiˇsekFranˇek (McMaster University, Canada) Jan Holub, Co-chair (Czech Technical University in Prague, Czech Republic) Costas S. Iliopoulos (King's College London, United Kingdom) Shunsuke Inenaga (Kyushu University, Japan) Shmuel T. Klein (Bar-Ilan University, Israel) Thierry Lecroq, Co-chair (Universitéde Rouen, France) Boˇrivoj Melichar, Honorary chair (Czech Technical University in Prague, Czech Republic) Yoan J. Pinzón (Universidad Nacional de Colombia, Colombia) Marie-France Sagot (INRIA Rhône-Alpes, France) William F. Smyth (McMaster University, Canada) Bruce W. Watson (FASTAR Group (Stellenbosch University and University of Pretoria, South Africa)) Jan Zdˇ 'árek (Czech Technical University in Prague, Czech Republic) Organising Committee Miroslav Bal´ık, Co-chair Jan Janouˇsek Boˇrivoj Melichar Jan Holub, Co-chair Jan Zdˇ 'árek External Referees Jérémy Barbay Juan Mendivelso Elise Prieur-Gaston Loek Cleophas Yuto Nakashima Ayumi Shinohara Arnaud Lefebvre iii Preface The proceedings in your hands contains the papers presented in the Prague Stringol- ogy Conference 2014 (PSC 2014) at the Czech Technical University in Prague, which organizes the event. The conference was held on September 1{3, 2014 and it focused on stringology and related topics. Stringology is a discipline concerned with algorith- mic processing of strings and sequences. The papers submitted were reviewed by the program committee. Eighteen papers were selected, based on originality and quality, as regular papers for presentations at the conference. This volume contains not only these selected papers but also an abstract of one invited talk \On the Number of Distinct Squares". The Prague Stringology Conference has a long tradition. PSC 2014 is the eigh- teenth event of the Prague Stringology Club. In the years 1996{2000 the Prague Stringology Club Workshops (PSCW's) and the Prague Stringology Conferences (PSC's) in 2001{2006, 2008{2013 preceded this conference. The proceedings of these workshops and conferences have been published by the Czech Technical University in Prague and are available on web pages of the Prague Stringology Club. Selected contributions were published in special issues of journals the Kybernetika, the Nordic Journal of Computing, the Journal of Automata, Languages and Combinatorics, the International Journal of Foundations of Computer Science, and the Discrete Applied Mathematics. The Prague Stringology Club was founded in 1996 as a research group in the Czech Technical University in Prague. The goal of the Prague Stringology Club is to study algorithms on strings, sequences, and trees with emphasis on automata theory. The first event organized by the Prague Stringology Club was the workshop PSCW'96 featuring only a handful of invited talks. However, since PSCW'97 the papers and talks are selected by a rigorous peer review process. The objective is not only to present new results in stringology and related areas, but also to facilitate personal contacts among the people working on these problems. We would like to thank all those who had submitted papers for PSC 2014 as well as the reviewers. Special thanks go to all the members of the program committee, without whose efforts it would not have been possible to put together such a stim- ulating program of PSC 2014. Last, but not least, our thanks go to the members of the organizing committee for ensuring such a smooth running of the conference. In Prague, Czech Republic on September 2014 Jan Holub and Thierry Lecroq v Table of Contents Invited Talk On the Number of Distinct Squares by Frantisek Franek :::::::::::::::::::: 1 Contributed Talks Fast Regular Expression Matching Based On Dual Glushkov NFA by Ryutaro Kurai, Norihito Yasuda, Hiroki Arimura, Shinobu Nagayama, and Shin-ichi Minato :::::::::::::::::::::::::::::::::::::::::::::::::::::: 3 A Process-Oriented Implementation of Brzozowski's DFA Construction Algorithm by Tinus Strauss, Derrick G. Kourie, Bruce W. Watson, and Loek Cleophas :::::::::::::::::::::::::::::::::::::::::::::::::::::::: 17 Efficient Online Abelian Pattern Matching in Strings by Simulating Reactive Multi-Automata by Domenico Cantone and Simone Faro ::::::::::::::::::: 30 Computing Abelian Covers and Abelian Runs by Shohei Matsuda, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda :::::::::::::::::::::::::::: 43 Two Squares Canonical Factorization by Haoyue Bai, Frantisek Franek, and William F. Smyth ::::::::::::::::::::::::::::::::::::::::::::::::::::: 52 Multiple Pattern Matching Revisited by Robert Susik, Szymon Grabowski, and Kimmo Fredriksson :::::::::::::::::::::::::::::::::::::::::::::::: 59 Improved Two-Way Bit-parallel Search by Branislav Durian,ˇ Tamanna Chhabra, Sukhpal Singh Ghuman, Tommi Hirvola, Hannu Peltola, and Jorma Tarhio::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 71 Using Correctness-by-Construction to Derive Dead-zone Algorithms by Bruce W. Watson, Loek Cleophas, and Derrick G. Kourie :::::::::::::::::: 84 Random Access to Fibonacci Codes by Shmuel T. Klein and Dana Shapira:::: 96 Speeding up Compressed Matching with SBNDM2 by Kerttu Pollari-Malmi, Jussi Rautio, and Jorma Tarhio ::::::::::::::::::::::::::::::::::::::::: 110 Threshold Approximate Matching in Grammar-Compressed Strings by Alexander Tiskin ::::::::::::::::::::::::::::::::::::::::::::::::::::: 124 Metric Preserving Dense SIFT Compression by Shmuel T. Klein and Dana Shapira :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 139 Approximation of Greedy Algorithms for Max-ATSP, Maximal Compression, Maximal Cycle Cover, and Shortest Cyclic Cover of Strings by Bastien Cazaux and Eric Rivals :::::::::::::::::::::::::::::::::::::::::::::::: 148 vii Closed Factorization by Golnaz Badkobeh, Hideo Bannai, Keisuke Goto, Tomohiro I, Costas S. Iliopoulos, Shunsuke Inenaga, Simon J. Puglisi, and Shiho Sugimoto :::::::::::::::::::::::::::::::::::::::::::::::::::::: 162 Alternative Algorithms for Lyndon Factorization by Sukhpal Singh Ghuman, Emanuele Giaquinta, and Jorma Tarhio :::::::::::::::::::::::::::::::::: 169 Two Simple Full-Text Indexes Based on the Suffix Array by Szymon Grabowski and Marcin Raniszewski :::::::::::::::::::::::::::::::::::::: 179 Reducing Squares in Suffix Arrays by Peter Leupold :::::::::::::::::::::::: 192 New Tabulation and Sparse Dynamic Programming Based Techniques for Sequence Similarity Problems by Szymon Grabowski ::::::::::::::::::::::: 202 Author Index ::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 21 viii On the Number of Distinct Squares Abstract Frantisek Franek Department of Computing and Software McMaster University, Hamilton, Ontario, Canada [email protected] Abstract. Counting the number of types of squares rather than their occurrences, we consider the problem of bounding the maximum number of distinct squares in a string. Fraenkel and Simpson showed in 1998 that a string of length n contains at most 2n distinct squares and indicated that all evidence pointed to n being a natural universal upper bound. Ilie simplified the proof of Fraenkel-Simpson’s key lemma in 2005 and presented in 2007 an asymptotic upper bound of 2nΘ(log n). We show that a string of length n contains at most 11n/6 distinct squares for any n. This new universal upper bound is obtained by investigating⌊ ⌋ the combinatorial structure of FS- double squares (named so in honour of Fraenkel and Simpson’s pioneering work on the problem), i.e. two rightmost-occurring squares that start at the same position, and showing that a string of length n contains at most 5n/6 FS-double squares. We will also discuss a much more general approach to double-s⌊ quares,⌋ i.e. two squares starting at the same position and satisfying certain size conditions. A complete, so- called canonical factorization of double-squares that was motivated by the work on the number of distinct squares is presented in a separate contributed talk at this conference. The work on the problem of the number of distinct squares is a joint effort with Antoine Deza and Adrien Thierry. At the time of the presentation of this talk, the slides of the talk are also available at http://www.cas.mcmaster.ca/~franek/PSC2014/invited-talk-slides.pdf This work was supported by the Natural Sciences and Engineering Research Council of Canada References 1. M. Crochemore and W. Rytter: Squares, cubes, and time-space efficient string searching. Algorithmica, 13:405–425, 1995. 2. A. Deza and F. Franek: A d-step approach to the maximum number of distinct squares and runs in strings. Discrete Applied Mathematics, 163:268–274, 2014. 3. A. Deza, F. Franek, and M Jiang: A computational framework for determining square- maximal strings. In J. Holub and J. Zˇdárek,ˇ editors, Proceedings of the Prague Stringology Conference 2012, pp. 111–119, Czech Technical University in Prague, Czech Republic, 2012. 4. A.S. Fraenkel and J. Simpson: How many squares can a string contain? Journal of Combinatorial Theory, Series A, 82(1):112–120, 1998. 5. F. Franek, R.C.G. Fuller, J. Simpson, and W.F. Smyth: More results on overlapping squares. Journal of Discrete Algorithms, 17:2–8, 2012. 6. L. Ilie: A simple proof that a

Proceedings of the Prague Stringology Conference 2014

The Book Review Column1 by Frederic Green Department of Mathematics and Computer Science Clark University Worcester, MA 02465 Email: [email protected]

Text Searching

{PDF} Jewels of Stringology: Text Algorithms

Stringology Ewels of Stringology This Page Is Intentionally Left Blank .^ X

IMPROVED ALGORITHMS for STRING SEARCHING PROBLEMS Doctoral Dissertation

Faster Recovery of Approximate Periods Over Edit Distance

Efficient Data Structures for Internal Queries in Texts