Topics in Combinatorial Pattern Matching

Downloaded from orbit.dtu.dk on: Sep 27, 2021 Topics in combinatorial pattern matching Vildhøj, Hjalte Wedel Publication date: 2015 Document Version Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): Vildhøj, H. W. (2015). Topics in combinatorial pattern matching. Technical University of Denmark. DTU Compute PHD-2014 No. 348 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. TOPICS IN COMBINATORIAL PATTERN MATCHING Hjalte Wedel Vildhøj PHD-2014-348 Technical University of Denmark Department of Applied Mathematics and Computer Science Richard Petersens Plads, Building 324, 2800 Kongens Lyngby, Denmark Phone +45 4525 3031 [email protected] www.compute.dtu.dk PHD-2014-348 ISSN: 0909-3192 PREFACE This doctoral dissertation was prepared at the Department of Applied Mathe- matics and Computer Science at the Technical University of Denmark in partial fulfilment of the requirements for acquiring a doctoral degree. The dissertation presents selected results within the general topic of theoretical computer science and more specifically combinatorial pattern matching. All included results were obtained and published in peer-reviewed conference proceedings or journals during my enrollment as a PhD student from September 2011 to September 2014. Acknowledgements I am grateful to my advisors Inge Li Gørtz and Philip Bille who have provided excellent guidance and support during my studies and have introduced me to countless of interesting problems, papers and people. I also thank my colleagues, office mates and friends Frederik Rye Skjoldjensen, Patrick Hagge Cording, and Søren Vind for making the past three years anything but lonely. I am further grateful to our section head Paul Fischer and the rest of our section for creating a fantastic atmosphere for research. I also owe my thanks to Gadi Landau and Oren Weimann for hosting me during my external stay at Haifa University in Israel, and for broadening my mind about their beautiful country. I also thank Raphaël Clifford and Benjamin Sach for hosting me during a memorable research visit at Bristol University. Moreover, I wish to thank all my co-authors Philip Bille, Søren Bøg, Patrick Hagge Cording, Johannes Fischer, Inge Li Gørtz, Tomasz Kociumaka, Tsvi Kopelowitz, Benjamin Sach, Tatiana Starikovskaya, Morten Stöckel, Søren Vind and David Kofoed Wind. Last but not least, I thank Maja for her love and support. Hjalte Wedel Vildhøj Copenhagen, September 2014 i ABSTRACT This dissertation studies problems in the general theme of combinatorial pattern matching. More specifically, we study the following topics: Longest Common Extensions. We revisit the longest common extension (LCE) problem, that is, preprocess a string T into a compact data structure that supports fast LCE queries. An LCE query takes a pair (i; j) of indices in T and returns the length of the longest common prefix of the suffixes of T starting at positions i and j. Such queries are also commonly known as longest common prefix (LCP) queries. We study the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case time for answering an LCE query. Let n be the length of T . Given a parameter τ, 1 τ n, we show how to achieve either (n=pτ) space and (τ) query time,≤ or≤ (n/τ) space and (τ log( LCE(i; j)O/τ)) query time, whereO LCE(i; j) denotesO the length of theO LCE returnedj byj the query. These bounds providej thej first smooth trade-offs for the LCE problem and almost match the previously known bounds at the extremes when τ = 1 or τ = n. We apply the result to obtain improved bounds for several applications where the LCE problem is the computational bottleneck, including approximate string matching and computing palindromes. We also present an efficient technique to reduce LCE queries on two strings to one string. Finally, we give a lower bound on the time-space product for LCE data structures in the non-uniform cell probe model showing that our second trade-off is nearly optimal. Fingerprints in Compressed Strings. The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. We show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that supports fingerprint queries. That is, given indices i and j, the answer to a query is the iii iv TOPICS IN COMBINATORIAL PATTERN MATCHING fingerprint of the substring S[i; j]. We present the first (n) space data structures that answer fingerprint queries without decompressingO any characters. For Straight Line Programs (SLP) we get (log N) query time, and for Linear SLPs (an SLP derivative that captures LZ78O compression and its variations) we get (log log N) query time. Hence, our data structures has the same time and spaceO complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time (log N log `) and (log ` log log ` + log log N) for SLPs and Linear SLPs, respec- tively.O Here, ` = LCE(O i; j) denotes the length of the LCE. j j Sparse Text Indexing. We present efficient algorithms for constructing sparse suffix trees, sparse suffix arrays and sparse positions heaps for b arbitrary positions of a text T of length n while using only (b) words of space during the construction. Our main contribution is to showO that the sparse suffix tree (and array) can be constructed in (n log2 b) time. To achieve this we develop a technique, that allows to efficientlyO answer b longest common prefix queries on suffixes of T , using only (b) space. Our first solution is Monte-Carlo and outputs the correct tree withO high probability. We then give a Las-Vegas algorithm which also uses (b) space and runs in the same time bounds with high probability when b = O(pn). Furthermore, additional tradeoffs between the space usage and the constructionO time for the Monte-Carlo algorithm are given. Finally, we show that at the expense of slower pattern queries, it is possible to construct sparse position heaps in (n + b log b) time and (b) space. O O The Longest Common Substring Problem. Given m documents of total length n, we consider the problem of finding a longest string common to at least d 2 of the documents. This problem is known as the longest common substring (LCS)≥ problem and has a classic (n) space and (n) time solution (Weiner [FOCS’73], Hui [CPM’92]). However,O the use of linearO space is impractical in many applications. We show several time-space trade-offs for this problem. Our main result is that for any trade-off parameter 1 τ n, the LCS problem can be solved in (τ) space and (n2/τ) time, thus≤ providing≤ the first smooth deterministic time-spaceO trade-offO from constant to linear space. The result uses a new and very simple algorithm, which computes a τ-additive approximation to the LCS in (n2/τ) time and (1) space. We also show a time-space trade-off lower bound forO deterministic branchingO programs, which implies that any deterministic RAM algorithm solving the LCS problem on documents from a sufficiently large al- p phabet in (τ) space must use Ω(n log(n=(τ log n))= log log(n=(τ log n)) time. O ABSTRACT v Structural Properties of Suffix Trees. We study structural and combinatorial properties of suffix trees. Given an unlabeled tree on n nodes and suffix links of its internal nodes, we ask the question “Is aT suffix tree?", i.e., is there a string S whose suffix tree has the same topologicalT structure as ? We place no restrictions on S, in particular we do not require that S ends withT a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. We prove that is a suffix tree if and only if it is realized by a string S of length n 1, and weT give a linear-time algorithm for inferring S when the first letter on− each edge is known. DANISH ABSTRACT Denne afhandling studerer problemer inden for det generelle omrade kombina- torisk mønstergenkendelse. Vi studerer følgende emner: Længste fælles præfiks. Vi vender tilbage til længste-fælles-præfiks-problemet, det vil sige præprocesser en streng T til en kompakt datastruktur, der under- støtter hurtige LCE-forespørgsler. En LCE-forespørgsel tager et par (i; j) af positioner i T og returnerer det længste fælles præfiks af de to suffikser, der starter pa position i og j i T .Sadanne forespørgsler er ogsa kendt som LCP- forespørgsler. Vi studerer mulige afvejninger af tid og plads for problemet – det vil sige den plads, som datastrukturen anvender versus den tid, den skal bruge til at svare pa en LCE-forespørgsel.

Load more