Efficient Data Structures for Internal Queries in Texts

University of Warsaw Faculty of Mathematics, Informatics and Mechanics Tomasz Kociumaka Efficient Data Structures for Internal Queries in Texts PhD dissertation Supervisor prof. dr hab. Wojciech Rytter Institute of Informatics University of Warsaw October 2018 Author’s declaration: I hereby declare that this dissertation is my own work. October 26, 2018 . Tomasz Kociumaka Supervisor’s declaration: The dissertation is ready to be reviewed October 26, 2018 . prof. dr hab. Wojciech Rytter i Abstract This thesis is devoted to internal queries in texts, which ask to solve classic text-processing problems for substrings of a given text. More precisely, the task is to preprocess a static string T of length n (called the text) and construct a data structure answering certain questions about the substrings of T . The substrings involved in each query are specified in constant space by their occurrences in T , called fragments of T , identified by the start and the end positions. Components for internal queries often become parts of more complex data structures, and they are used in many algorithms for text processing. Longest Common Extension Queries, asking for the length of the longest common prefix of two substrings of the text T , are by far the most popular internal queries. They are used for checking if two fragments match (represent the same string) and for lexicographic comparison of substrings. Due to an optimal solution in the standard setting of texts over polynomially-bounded integer alphabets, with O(1)-time queries, O(n) size, and O(n) construction time, they have found numerous applications across stringology. In this dissertation, we provide the first optimal data structure for smaller alphabets of size σ n: it handles queries in O(1) time, takes O(n/ logσ n) space, and admits an O(n/ logσ n)-time construction (from the packed representation of T with Θ(logσ n) characters in each machine word). We then go back to alphabets of size σ polynomial in n and focus on more complex internal queries. Our first data structure supports Internal Pattern Matching Queries, which ask for the occurrences of one substring x within another substring y. After O(n)-time preprocessing of the text T , it answers these queries in time proportional to the quotient |y|/|x| of substrings’ lengths, which is required due to the information content of the output. We also use this data structure for Period Queries, asking for the periods of a given substring. Here, our logarithmic query time is also optimal by a similar information-theoretic argument. Further data structures are designed for Minimal Suffix and Minimal Rota- tion Queries, asking to compute the lexicographically smallest non-empty suffix and cyclic rotation of a given substring, respectively. They are answered in O(1) time after O(n)-time preprocessing. We also consider a more general problem of simulating the suffix array of a given substring (Substring Suffix Selection Queries, asking for the kth lexicographically smallest suffix of a substring) and its inverse suffix array (Substring Suffix Rank Queries, asking for the lexicographic rank of a substring’s suffix). Our data structure supports√ these queries in O(log n) time, takes O(n) space, and can be constructed in O(n log n) time. The tools developed in this dissertation additionally yield improved results for several kinds of Substring Compression Queries, which ask for the compressed representation of a given substring obtained using a specific method; we consider schemes based on the Lempel–Ziv parsing and the Burrows–Wheeler transform. Our results combine text-processing tools with combinatorics on words and state-of-the-art general-purpose data structures. The key technical contribution is a novel locally consistent symmetry-breaking scheme, formalized in terms of synchronizing functions, which is central to our solutions for Longest Common Extension Queries and Internal Pattern Matching Queries. 2012 ACM Subject Classification: Theory of computation → Pattern matching Keywords: Data structures, Longest common extension, Longest common prefix, Local consistency, Minimal Suffix, Period, Substring compression, Substring queries ii Streszczenie Rozprawa dotyczy zapytań wewnętrznych w tekstach, czyli rozwiązywania kla- sycznych problemów algorytmiki tekstów dla podsłów danego słowa. Formalnie, zadanie polega na wstępnym przetworzeniu statycznego słowa T długości n (tekstu), tak aby skonstruować strukturę danych odpowiadającą na pewne pytania dotyczące jego podsłów. Podsłowa te są identyfikowane w stałej pamięci za pomocą wystąpień w T , zwanych fragmentami tekstu i reprezentowanych przez pozycje początkową oraz końcową. Komponenty dla zapytań wewnętrznych są używane w wielu algorytmach tekstowych i wchodzą w skład bardziej rozbudowanych struktur danych. Zapytania o długość najdłuższego wspólnego prefiksu dwóch podsłów tekstu T stanowią zdecydowanie najpowszechniejszy problem tego rodzaju. Używa się ich między innymi do sprawdzania, czy dwa fragmenty pasują do siebie (reprezentują to samo słowo), i do porównywania podsłów w porządku leksykograficznym. Mnogość zastosowań wynika z istnienia optymalnego rozwiązania w standardowym modelu tekstów (nad alfabetem złożonym z liczb naturalnych wielkości wielomianowej ze względu na n), czyli struktury danych wielkości O(n), która odpowiada na zapytania w czasie O(1) i posiada procedurę konstrukcji w czasie O(n). W rozprawie przedstawiamy pierwszą optymalną strukturę danych dla małych alfabetów rozmiaru σ n: strukturę wielkości O(n/ logσ n) charakteryzującą się czasem zapytania O(1) i czasem konstrukcji O(n/ logσ n) (na podstawie spakowanej reprezentacji tekstu, gdzie każda komórka pamięci zawiera Θ(logσ n) symboli). W dalszej części rozprawy wracamy do tekstów nad alfabetem wielkości wielomianowej ze względu na n i skupiamy się na bardziej złożonych zapytaniach wewnętrznych. Nasza pierwsza struktura danych rozwiązuje problem wewnętrznego wyszukiwania wzorca, czyli zapytań o wystąpienia jednego podsłowa x w obrębie drugiego podsłowa y. Po przetworzeniu tekstu T w czasie O(n), odpowiada ona na takie zapytania w czasie proporcjonalnym do ilorazu |y|/|x| długości podsłów, który jest wymagany ze względu na pesymistyczną złożoność informacji zawartej w wyniku. Tej samej struktury danych używamy także do zapytań o wszystkie okresy wskazanego podsłowa; uzyskany tutaj czas logarytmiczny również jest optymalny. Kolejne struktury danych stworzone zostały do obsługi zapytań o najmniejszy leksykograficznie sufiks i najmniejszą leksykograficznie rotację cykliczną wskazanego podsłowa. Odpowiadamy na nie w czasie O(1) po przetworzeniu tekstu w czasie O(n). Rozważamy także bardziej ogólne pytania o k-ty leksykograficznie sufiks podsłowa (czyli o k-ty elementy tablicy sufiksowej podsłowa) oraz o leksykograficzną rangę wskazanego sufiksu podsłowa (czyli element odwrotnej tablicy sufiksowej). Nasza struktura danych przetwarza zapytania obydwu rodzajów w√ czasie O(log n), zajmuje O(n) komórek pamięci i można ją zbudować w czasie O(n log n). Stworzone w rozprawie narzędzia dają także lepsze od znanych dotychczas rozwiązania problemu kompresji podsłów, czyli zapytań o reprezentację podsłowa według ustalonego algorytmu kompresji; rozważamy w tym kontekście metody oparte na parsowaniu Lempela-Ziva i transformacie Burrowsa-Wheelera. Rezultaty rozprawy łączą narzędzia do przetwarzania tekstów z kombinatoryką słów i najnowszymi osiągnięciami w dziedzinie abstrakcyjnych struktur danych. Naszą kluczową technikę stanowi innowacyjna metoda lokalnie zgodnego łamania symetrii między pozycjami tekstu, która stoi za nowymi rozwiązaniami dla zapytań o długość najdłuższego wspólnego prefiksu i dla wewnętrznego wyszukiwania wzorca. Tytuł rozprawy w języku polskim: Efektywne struktury danych dla zapytań wewnętrznych w tekstach Słowa kluczowe: struktury danych, najdłuższy wspólny prefiks, lokalna zgodność, najmniejszy sufiks, okres, kompresja podsłów, zapytania o podsłowa iii Acknowledgements First, I wish to thank Wojciech Rytter, my advisor, for introducing me to text algorithms and guidance over the course of my undergraduate and graduate studies. I am also indebted to Jakub Radoszewski, who was the first to expose me to research problems and since then has become my closest collaborator. My further gratitude goes to Tomasz Waleń, with whom we have created a successful team at the University of Warsaw. Special thanks must be made to Paweł Gawrychowski, to whom I owe some of the most interesting problems I have been working on, as well as several short research visits in Saarbrücken, Haifa, and Wrocław. I am also obliged to Tatiana Starikovskaya and Maxim Babenko, collaboration with whom on Minimal Suffix Queries inspired most of the directions I followed preparing this thesis. I particularly appreciate numerous conversations with Tatiana and her warm welcome during my short stays in Moscow. Listed above are the coauthors of the papers that this dissertation is based on. However, a few other people have also made a significant impact on its contents. I have greatly benefited from Ely Porat, who has invited me to Bar-Ilan Univer- sity and offered me an early access to his results on local consistency and LCE Queries [24]. I wish to thank Moshe Lewenstein for drawing my attention to Generalized LZ Substring Compression Queries, as well as Djamal Belaz- zougui, Travis Gagie, and Simon J. Puglisi, who suggested applications related to the Burrows–Wheeler transform. I am also grateful to Dominik Kempa for proofreading the previously unpublished Chapter 4 of

Efficient Data Structures for Internal Queries in Texts

The Book Review Column1 by Frederic Green Department of Mathematics and Computer Science Clark University Worcester, MA 02465 Email: [email protected]

Text Searching

{PDF} Jewels of Stringology: Text Algorithms

Stringology Ewels of Stringology This Page Is Intentionally Left Blank .^ X

IMPROVED ALGORITHMS for STRING SEARCHING PROBLEMS Doctoral Dissertation

Faster Recovery of Approximate Periods Over Edit Distance

Proceedings of the Prague Stringology Conference 2014