Data Structures for Efficient String Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Data Structures for Efficient String Algorithms Johannes Fischer M¨unchen 2007 Data Structures for Efficient String Algorithms Johannes Fischer Dissertation an der Fakult¨at f¨ur Mathematik, Informatik und Statistik der Ludwig–Maximilians–Universit¨at M¨unchen vorgelegt von Johannes Fischer aus Kronberg i. Ts. M¨unchen, den 10. Oktober 2007 Erstgutachter: Prof. Dr. Volker Heun Zweitgutachter: Prof. Dr. Enno Ohlebusch Tag der m¨undlichen Pr¨ufung: 8. Oktober 2007 Abstract This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time prepro- cessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n + o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space con- sumption of a factor of log n. It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time- consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n + o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(mσ + occ) time, where σ denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m log σ+occ) locating time, replacing our original scheme with a data structure of size 2.54 n bits. Again by using RMQs, we then show ≈ how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means. Zusammenfassung Diese Arbeit besch¨aftigt sich mit Datenstrukturen, die insbesondere im String Matching und String Mining Anwendung finden. Das Hauptresultat ist ein O(n)-Konstruktionsalgorithmus f¨ur eine Datenstruktur der Gr¨oße 2n + o(n) Bits, die es erlaubt, f¨ur ein Feld von n Zahlen die Position eines minimalen Ele- ments in einem beliebigen Anfrageintervall in konstanter Zeit zu finden (sog. RMQs f¨ur Range Minimum Queries). Dies verbessert alle vorherigen Resul- tate zu diesem Problem. Die Haupttechniken, die zu diesem Ergebnis f¨uhren, beruhen auf kombinatorischen Eigenschaften von Feldern und sog. Kartesi- schen B¨aumen. Wir zeigen zudem, dass dieses Schema zur Pr¨aprozessierung eines Feldes f¨ur konstante RMQs im allgemeinen Fall asymptotisch optimal ist, man jedoch f¨ur komprimierbare Eingabefelder weiteren Platz einsparen kann, ohne dass die Zeitoptimalit¨at leidet. F¨ur die zweidimensionale Variante des- selben Problems stellen wir einen quasi-zeitoptimalen Algorithmus vor, dessen Platzbedarf jedoch etwas h¨oher ist (asymptotisch um den Faktor log n). Neben den bereits bekannten Algorithmen, die RMQs zur L¨osung bestimmter Probleme verwenden (z.B. die Berechnung niedrigster gemeinsamer Vorfahren in B¨aumen), betrachten wir im zweiten Teil dieser Dissertation neue Anwen- dungen des RMQ-Problems. Wir zeigen, dass die oben genannte Datenstruk- tur (und eine Variante hiervon) zu Verbesserungen im Zeit- und Platzbedarf des Enhanced Suffix Array f¨uhrt, einer Sammlung von Feldern zur L¨osung vieler String-Matching-Probleme. Insbesondere werden wir sehen, dass die 2n + o(n) Bits von unserer RMQ-Struktur an zus¨atzlichem Platz ausreichen, um mit dem Suffix- und LCP-Array alle vk Vorkommen eines (normalerweise kleinen) Suchtextes der L¨ange m in einem (normalerweise großen) Text der L¨ange n in Zeit O(mσ + vk) zu finden, wobei σ die Gr¨oße des Alphabets bezeichnet. F¨ur konstante Alphabetgr¨oßen ist dies sicherlich optimal; im Falle nicht konstanter Alphabetgr¨oßen k¨onnen wir dieses Resultat jedoch verbessern: ca. 2.54 n + o(n) zus¨atzliche Bits reichen aus, um mit dem Suffix- und LCP- Array alle vk Vorkommen eines Suchtextes in Zeit O(m log σ + vk) zu finden. Wiederum durch Anwendung unseres Resultats ¨uber RMQs zeigen wir dann, wie wir h¨aufigkeitsbasierte String-Mining-Anfragen in Textdatenbanken in op- timaler Zeit l¨osen k¨onnen. In einem abschließenden Kapitel geben wir einen zeit- und platzoptimalen Konstruktionsalgorithmus f¨ur Suffix Arrays auf Tex- ten, die in logische Einheiten (z.B. nat¨urlichsprachige W¨orter) eingeteilt sind, wenn man bei der Mustersuche nur an Treffern interessiert ist, die an Wort- grenzen beginnen. Die meisten der in dieser Dissertation vorgestellten Resultate haben neben ihrer theoretischen Relevanz auch praktische Bedeutung; eine Tatsache, die mittels empirischer Tests und Vergleichen auf realen Daten unterstrichen wird. In den meisten F¨allen verbessern unsere neuen Algorithmen die Laufzeiten und den Platzbedarf fr¨uherer Ans¨atze deutlich. CONTENTS 1 Introduction 1 1.1 Synopsis................................ 2 1.2 Practicability of the Results . 5 1.3 PreviousPublications . .. .. .. .. .. .. 5 1.4 A Note on Citation Policy . 6 2 Basic Concepts 7 2.1 ArraysandStrings .......................... 7 2.2 BasicStringMatchingProblems . 8 2.3 SuffixTrees .............................. 9 2.4 Suffix-andLCP-Arrays . 10 2.5 Text Compressibility and Empirical Entropy . 11 2.6 Compact Data Structures for Rank and Select . 12 2.7 Succinct Representations of Suffix- and LCP-Arrays . 13 2.8 Balanced Parentheses Representation of Trees . 14 3 An Optimal Preprocessing Scheme for RMQ 17 3.1 ChapterIntroduction. 17 3.2 PreliminaryDefinitions . 18 3.3 ApplicationsofRMQ.. .. .. .. .. .. .. 19 3.3.1 Computing Lowest Common Ancestors in Trees . 19 3.3.2 Computing Longest Common Extensions of Suffixes . 20 3.3.3 (Re-)Construction of Suffix Links . 21 3.3.4 Document Retrieval Queries . 22 3.3.5 Maximum-Sum Segment Queries . 22 3.3.6 Lempel-Ziv-77 Data Compression . 22 3.4 PreviousResultsonRMQs . 23 3.4.1 The Berkman-Vishkin Algorithm . 23 ix x 3.4.2 Alstrup et al.’s Idea for Handling In-Block-Queries . 25 3.4.3 Sadakane’s Succinct RMQ-Algorithm . 26 3.5 AnImprovedAlgorithm . 29 3.5.1 A New Code for Binary Trees . 30 3.5.2 TheAlgorithm ........................ 36 3.6 ALowerBound............................ 41 3.7 Applications of the New Algorithm . 42 3.7.1 LCAs in Trees with Small Average Degree . 42 3.7.2 An Improved Algorithm for Longest Common Extensions 43 3.8 PracticalConsiderations . 44 3.9 HowtoFurtherReduceSpace . 51 3.9.1 Small “Alphabets” Don’t Help! . 51 3.9.2 CompressionTechniques . 52 3.9.3 Outlook ............................ 55 3.10 SummaryandDiscussion . 56 4 2-Dimensional RMQs 57 4.1 ChapterIntroduction. 57 4.2 Preliminaries ............................. 58 4.3 Methods................................ 58 4.3.1 A General Trick for Query Precomputation . 59 4.3.2 Linear Preprocessing of the First Level . 60 4.3.3 Recursive Partitioning . 62 4.3.4 What’s Left: How to Find the Right Grid . 64 4.4 SummaryandOutlook. .. .. .. .. .. .. 65 5 Improvements in the Enhanced Suffix Array 67 5.1 ChapterIntroduction. 67 5.2 EnhancedSuffixArrays . 68 5.3 An RMQ-based Representation of the Child-Table . 70 5.4 Pattern Matching in O(m Σ )Time................. 71 | | 5.5 Pattern Matching in O(m log Σ )Time .............. 72 | | 5.5.1 BasicIdea........................... 72 5.5.2 A Pseudo-Median Algorithm for RMQ on LCP . 73 5.5.3 SummingUp ......................... 81 6 String Mining Problems 83 6.1 ChapterIntroduction. 83 6.2 FormalProblemDefinition. 84 6.3 TheNewAlgorithm ......................... 86 6.3.1 Preprocessing.. .. .. .. .. .. .. 88 6.3.2 Labeling............................ 89 6.3.3 Extraction........................... 90 6.3.4 ReducingtheSizeoftheOutput . 93 6.4 PracticalPerformance . 94 6.5 Conclusions .............................. 98 xi 7 Suffix Arrays on Words 99 7.1 ChapterIntroduction. 99 7.2 ChapterOutline ...........................101 7.3 Definitions...............................102 7.4 Optimal Construction of the Word Suffix Array . 102 7.5 Word-BasedLCP-Arrays. .105 7.6 SearchingintheWordSuffixArray . 107 7.6.1 Searching in O(m log k)Time................107 7.6.2 Searching in O(m + log k)Time...............109 7.6.3 Searching in O(m Σ ) and O(m log Σ )Time . .109 | | | | 7.7 ExperimentalResults. .110 7.8 Conclusions ..............................116 Summary of Notation 117 Bibliography 119 CHAPTER 1 Introduction The title of this thesis, “Data Structures for Efficient String Algorithms,” al- ready suggests that this work is situated between two main areas of computer science: data structures and string algorithmics, the latter being meant to in- clude both string matching and string mining. However, although almost all of our results on data structures do have applications in string matching, these two terms should not exclusively be seen in conjunction, as more than half of the present text is devoted to a problem whose applicability goes far beyond the area of stringology.