Analysis of Algorithms and Data Structures for Text Indexing
Total Page:16
File Type:pdf, Size:1020Kb
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ FAKULTÄT FÜR INFORMATIK ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Effiziente Algorithmen Analysis of Algorithms and Data Structures for Text Indexing Moritz G. Maaß ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ FAKULTÄT FÜR INFORMATIK ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Effiziente Algorithmen Analysis of Algorithms and Data Structures for Text Indexing Moritz G. Maaß Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Dr. h.c. mult. Wilfried Brauer Prüfer der Dissertation: 1. Univ.-Prof. Dr. Ernst W. Mayr 2. Prof. Robert Sedgewick, Ph.D. (Princeton University, New Jersey, USA) Die Dissertation wurde am 12. April 2005 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 26. Juni 2006 angenommen. Abstract Large amounts of textual data like document collections, DNA sequence data, or the Internet call for fast look-up methods that avoid searching the whole corpus. This is often accomplished using tree-based data structures for text indexing such as tries, PATRICIA trees, or suffix trees. We present and analyze improved algorithms and index data structures for exact and error-tolerant search. Affix trees are a data structure for exact indexing. They are a generalization of suffix trees, allowing a bidirectional search by extending a pattern to the left and to the right during retrieval. We present an algorithm that constructs affix trees on-line in both directions, i.e., by augmenting the underlying string in both directions. An amortized analysis yields that the algorithm has a linear-time worst-case complexity. A space efficient method for error-tolerant searching in a dictionary for a pattern allowing some mismatches can be implemented with a trie or a PATRICIA tree. For a given mismatch probability q and a given maximum of allowed mismatches d, we study the average-case complexity of the number of comparisons for searching in a trie with n strings over an alphabet of size σ. Using methods from complex analysis, we derive a sublinear behavior for d < logσ n. For constant d, we can distinguish three cases depending upon q. For example, the search complexity for the Hamming distance is σ(σ 1)d/((d + 1)!) logd+1 n + O(logd n). − σ To enable an even more efficient search, we utilize an index of a limited d-neigh- borhood of the text corpus. We show how the index can be used for various search problems requiring error-tolerant look-up. An average-case analysis proves that the index size is O(n logd n) while the look-up time is optimal in the worst-case with respect to the pattern size and the number of reported occurrences. It is possible to modify the data structure so that its size is bounded in the worst-case while the bound on the look-up time becomes average-case. iii Acknowledgments First, I thank my advisor Ernst W. Mayr for his helpful guidance and generous support throughout the time of researching and writing this thesis. Furthermore, I am also thankful to the current and former members of the Lehrstuhl für Effiziente Algorithmen for interesting discussions and encouragement, especially to Thomas Erlebach, Sven Kosub, Hanjo Täubig, and Johannes Nowak. Lastly, I am grateful to Anja Heilmann and my mother for proofreading the final work. v Contents 1 Introduction 1 1.1 FromSuffixTreestoAffixTrees . 2 1.2 Approximate Text Indexing . 4 1.2.1 Tree-Based Accelerators for Approximate Text Indexing . 5 1.2.2 Registers for Approximate Text Indexing . 7 1.3 Thesis Organization . 8 1.4 Publications . 9 2 Preliminaries 11 2.1 Elementary Concepts . 11 2.1.1 Strings ............................. 11 2.1.2 TreesoverStrings ....................... 13 2.1.3 String Distances . 14 2.2 Basic Principles of Algorithm Analysis . 15 2.2.1 Complexity Measures . 15 2.2.2 Amortized Analysis . 16 2.2.3 Average-Case Analysis . 17 2.3 Basic Data Structures . 19 2.3.1 Tree-Based Data Structures for Text Indexing . 19 2.3.2 RangeQueries ......................... 22 2.4 TextIndexingProblems......................... 25 2.5 Rice’sIntegrals ............................. 26 3 Linear Construction of Affix Trees 31 3.1 Definitions and Data Structures for Affix Trees . 31 3.1.1 Basic Properties of Suffix Links . 31 3.1.2 AffixTrees ........................... 34 3.1.3 Additional Data for the Construction of Affix Trees . 35 3.1.4 Implementation Issues . 39 3.2 Construction of Compact Suffix Trees . 40 3.2.1 On-Line Construction of Suffix Trees . 40 3.2.2 Anti-On-Line Suffix Tree Construction with Additional Infor- mation ............................. 43 3.3 Constructing Compact Affix Trees On-Line . 47 3.3.1 Overview............................ 47 3.3.2 Detailed Description . 49 vii viii CONTENTS 3.3.3 An Example Iteration . 51 3.3.4 Complexity . 54 3.4 Bidirectional Construction of Compact Affix Trees . 54 3.4.1 Additional Steps . 55 3.4.2 Improving the Running Time . 56 3.4.3 Analysis of the Bidirectional Construction . 57 4 Approximate Trie Search 63 4.1 Problem Statement . 64 4.2 Average-Case Analysis of the LS Algorithm . 65 4.3 Average-Case Analysis of the TS Algorithm . 67 4.3.1 AnExactFormula ....................... 67 4.3.2 Approximation of Integrals with the Beta Function . 69 4.3.3 The Average Compactification Number . 77 4.3.4 Allowing a Constant Number of Errors . 79 4.3.5 Allowing a Logarithmic Number of Errors . 84 4.3.6 Remarks on the Complexity of the TS Algorithm . 87 4.4 Applications . 91 5 Text Indexing with Errors 93 5.1 Definitions and Data Structures . 95 5.1.1 A Closer Look at the Edit Distance . 95 5.1.2 WeakTries ........................... 99 5.2 Main Indexing Data Structure . 101 5.2.1 Intuition . 102 5.2.2 Definition of the Basic Data Structure . 102 5.2.3 Construction and Size . 104 5.2.4 Main Properties . 106 5.2.5 Search Algorithms . 107 5.3 Worst-Case Optimal Search-Time . 110 5.4 Bounded Preprocessing Time and Space . 113 6 Conclusion 115 Bibliography 118 Index 129 Figures and Algorithms 2.1 Examples of Σ+-trees.......................... 21 2.2 Illustration of the linear-time Cartesian tree algorithm . 23 3.1 Suffix trees, suffix link tree, and affix trees for ababc ......... 32 3.2 The affix trees for aabababa and for aabababaa............ 40 3.3 The procedure canonize() ....................... 42 3.4 Constructing ST(acabaabac) from ST(acabaaba) .......... 44 3.5 The procedure decanonize() ...................... 45 3.6 The procedure update-new-suffix() .................. 46 3.7 Constructing ST(cabaabaca) from ST(abaabaca) .......... 48 3.8 The function getTargetNodeVirtualEdge() ............. 49 3.9 Constructing AT(acabaabac) from AT(acabaaba), suffix view . 52 3.10 Constructing AT(acabaabac) from AT(acabaaba), prefix view . 53 4.1 TheLSalgorithm ............................ 64 4.2 TheTSalgorithm ............................ 65 4.3 Illustration for the proof of Lemma 4.6 . 76 4.4 Location of the poles of g(z) in the complex plane . 81 4.5 Illustration for Theorem 4.14. 88 4.6 Parameters for selected comparison-based string distances . 91 5.1 The relevant edit graph for international and interaction . 96 5.2 Examples of a compact and weak tries . 101 ix Chapter 1 Introduction Computers, although initially invented for numeric calculations, have changed the way that text is written, processed, and archived. Even though many more texts than ini- tially expected or hoped for are still printed to paper, writing is mostly done electron- ically today. Having text available in a digital form has many advantages: It can be archived in very little space, it can be reproduced with little effort and without loss in quality, and it can be searched by a computer. The latter is a great improvement over previous methods especially because it is not necessary to create a fixed set of terms that are used to index the documents. When employing a computer, one commonly allows every word to be used for searching, often called full-text search. Efficient methods for searching in texts are studied in the realm or pattern matching. Pattern matching is already a rather mature field of research with numerous textbooks avail- able [CR94, AG97, Gus97]; the basic algorithms are also part of standard books on algorithms [CLR90, GBY91]. Searching a text for an arbitrary pattern is usually al- most as fast as reading the text (especially, if reading is hampered in speed because the text is stored on a slow media such as a hard disk). The ease with which electronic documents can be stored has also lead to an enor- mous increase in the amount of textual data available. The Internet (in particular the World Wide Web), for example, is a tremendous and growing collection of text docu- ments. The widely used search engine Google reports to index more than eight billion web pages.1 Another type of textual data is biological sequence data (e.g., DNA se- quences, protein sequences). A popular database for DNA sequences, GenBank, was reported to store over 33 billion bases for over 140 000 species in August 2003 and to grow at a rate of over 1700 new species per month [BKML+04]. The sheer size of these text collections makes on-line searches unfeasible. There- fore, a lot of research has been devoted to methods for efficient text retrieval [FBY92] and text indexes. A text index is a data structure prepared for a document or a collec- tion of documents that facilitates efficient queries for a search pattern. These queries can have different types. For a query with a single pattern, one can search, e.g., for all occurrences of a search string, for all occurrences of an element of a set of strings, or for all occurrences of a substring that has length twenty and contains at least ten equal characters—the possibilities for different criteria are endless and usually the result of 1These numbers were taken from http://www.google.com/corporate/facts.html as of March 2005.