Analysis of Algorithms and Data Structures for Text Indexing

Total Page:16

File Type:pdf, Size:1020Kb

Analysis of Algorithms and Data Structures for Text Indexing ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ FAKULTÄT FÜR INFORMATIK ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Effiziente Algorithmen Analysis of Algorithms and Data Structures for Text Indexing Moritz G. Maaß ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ FAKULTÄT FÜR INFORMATIK ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Effiziente Algorithmen Analysis of Algorithms and Data Structures for Text Indexing Moritz G. Maaß Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Dr. h.c. mult. Wilfried Brauer Prüfer der Dissertation: 1. Univ.-Prof. Dr. Ernst W. Mayr 2. Prof. Robert Sedgewick, Ph.D. (Princeton University, New Jersey, USA) Die Dissertation wurde am 12. April 2005 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 26. Juni 2006 angenommen. Abstract Large amounts of textual data like document collections, DNA sequence data, or the Internet call for fast look-up methods that avoid searching the whole corpus. This is often accomplished using tree-based data structures for text indexing such as tries, PATRICIA trees, or suffix trees. We present and analyze improved algorithms and index data structures for exact and error-tolerant search. Affix trees are a data structure for exact indexing. They are a generalization of suffix trees, allowing a bidirectional search by extending a pattern to the left and to the right during retrieval. We present an algorithm that constructs affix trees on-line in both directions, i.e., by augmenting the underlying string in both directions. An amortized analysis yields that the algorithm has a linear-time worst-case complexity. A space efficient method for error-tolerant searching in a dictionary for a pattern allowing some mismatches can be implemented with a trie or a PATRICIA tree. For a given mismatch probability q and a given maximum of allowed mismatches d, we study the average-case complexity of the number of comparisons for searching in a trie with n strings over an alphabet of size σ. Using methods from complex analysis, we derive a sublinear behavior for d < logσ n. For constant d, we can distinguish three cases depending upon q. For example, the search complexity for the Hamming distance is σ(σ 1)d/((d + 1)!) logd+1 n + O(logd n). − σ To enable an even more efficient search, we utilize an index of a limited d-neigh- borhood of the text corpus. We show how the index can be used for various search problems requiring error-tolerant look-up. An average-case analysis proves that the index size is O(n logd n) while the look-up time is optimal in the worst-case with respect to the pattern size and the number of reported occurrences. It is possible to modify the data structure so that its size is bounded in the worst-case while the bound on the look-up time becomes average-case. iii Acknowledgments First, I thank my advisor Ernst W. Mayr for his helpful guidance and generous support throughout the time of researching and writing this thesis. Furthermore, I am also thankful to the current and former members of the Lehrstuhl für Effiziente Algorithmen for interesting discussions and encouragement, especially to Thomas Erlebach, Sven Kosub, Hanjo Täubig, and Johannes Nowak. Lastly, I am grateful to Anja Heilmann and my mother for proofreading the final work. v Contents 1 Introduction 1 1.1 FromSuffixTreestoAffixTrees . 2 1.2 Approximate Text Indexing . 4 1.2.1 Tree-Based Accelerators for Approximate Text Indexing . 5 1.2.2 Registers for Approximate Text Indexing . 7 1.3 Thesis Organization . 8 1.4 Publications . 9 2 Preliminaries 11 2.1 Elementary Concepts . 11 2.1.1 Strings ............................. 11 2.1.2 TreesoverStrings ....................... 13 2.1.3 String Distances . 14 2.2 Basic Principles of Algorithm Analysis . 15 2.2.1 Complexity Measures . 15 2.2.2 Amortized Analysis . 16 2.2.3 Average-Case Analysis . 17 2.3 Basic Data Structures . 19 2.3.1 Tree-Based Data Structures for Text Indexing . 19 2.3.2 RangeQueries ......................... 22 2.4 TextIndexingProblems......................... 25 2.5 Rice’sIntegrals ............................. 26 3 Linear Construction of Affix Trees 31 3.1 Definitions and Data Structures for Affix Trees . 31 3.1.1 Basic Properties of Suffix Links . 31 3.1.2 AffixTrees ........................... 34 3.1.3 Additional Data for the Construction of Affix Trees . 35 3.1.4 Implementation Issues . 39 3.2 Construction of Compact Suffix Trees . 40 3.2.1 On-Line Construction of Suffix Trees . 40 3.2.2 Anti-On-Line Suffix Tree Construction with Additional Infor- mation ............................. 43 3.3 Constructing Compact Affix Trees On-Line . 47 3.3.1 Overview............................ 47 3.3.2 Detailed Description . 49 vii viii CONTENTS 3.3.3 An Example Iteration . 51 3.3.4 Complexity . 54 3.4 Bidirectional Construction of Compact Affix Trees . 54 3.4.1 Additional Steps . 55 3.4.2 Improving the Running Time . 56 3.4.3 Analysis of the Bidirectional Construction . 57 4 Approximate Trie Search 63 4.1 Problem Statement . 64 4.2 Average-Case Analysis of the LS Algorithm . 65 4.3 Average-Case Analysis of the TS Algorithm . 67 4.3.1 AnExactFormula ....................... 67 4.3.2 Approximation of Integrals with the Beta Function . 69 4.3.3 The Average Compactification Number . 77 4.3.4 Allowing a Constant Number of Errors . 79 4.3.5 Allowing a Logarithmic Number of Errors . 84 4.3.6 Remarks on the Complexity of the TS Algorithm . 87 4.4 Applications . 91 5 Text Indexing with Errors 93 5.1 Definitions and Data Structures . 95 5.1.1 A Closer Look at the Edit Distance . 95 5.1.2 WeakTries ........................... 99 5.2 Main Indexing Data Structure . 101 5.2.1 Intuition . 102 5.2.2 Definition of the Basic Data Structure . 102 5.2.3 Construction and Size . 104 5.2.4 Main Properties . 106 5.2.5 Search Algorithms . 107 5.3 Worst-Case Optimal Search-Time . 110 5.4 Bounded Preprocessing Time and Space . 113 6 Conclusion 115 Bibliography 118 Index 129 Figures and Algorithms 2.1 Examples of Σ+-trees.......................... 21 2.2 Illustration of the linear-time Cartesian tree algorithm . 23 3.1 Suffix trees, suffix link tree, and affix trees for ababc ......... 32 3.2 The affix trees for aabababa and for aabababaa............ 40 3.3 The procedure canonize() ....................... 42 3.4 Constructing ST(acabaabac) from ST(acabaaba) .......... 44 3.5 The procedure decanonize() ...................... 45 3.6 The procedure update-new-suffix() .................. 46 3.7 Constructing ST(cabaabaca) from ST(abaabaca) .......... 48 3.8 The function getTargetNodeVirtualEdge() ............. 49 3.9 Constructing AT(acabaabac) from AT(acabaaba), suffix view . 52 3.10 Constructing AT(acabaabac) from AT(acabaaba), prefix view . 53 4.1 TheLSalgorithm ............................ 64 4.2 TheTSalgorithm ............................ 65 4.3 Illustration for the proof of Lemma 4.6 . 76 4.4 Location of the poles of g(z) in the complex plane . 81 4.5 Illustration for Theorem 4.14. 88 4.6 Parameters for selected comparison-based string distances . 91 5.1 The relevant edit graph for international and interaction . 96 5.2 Examples of a compact and weak tries . 101 ix Chapter 1 Introduction Computers, although initially invented for numeric calculations, have changed the way that text is written, processed, and archived. Even though many more texts than ini- tially expected or hoped for are still printed to paper, writing is mostly done electron- ically today. Having text available in a digital form has many advantages: It can be archived in very little space, it can be reproduced with little effort and without loss in quality, and it can be searched by a computer. The latter is a great improvement over previous methods especially because it is not necessary to create a fixed set of terms that are used to index the documents. When employing a computer, one commonly allows every word to be used for searching, often called full-text search. Efficient methods for searching in texts are studied in the realm or pattern matching. Pattern matching is already a rather mature field of research with numerous textbooks avail- able [CR94, AG97, Gus97]; the basic algorithms are also part of standard books on algorithms [CLR90, GBY91]. Searching a text for an arbitrary pattern is usually al- most as fast as reading the text (especially, if reading is hampered in speed because the text is stored on a slow media such as a hard disk). The ease with which electronic documents can be stored has also lead to an enor- mous increase in the amount of textual data available. The Internet (in particular the World Wide Web), for example, is a tremendous and growing collection of text docu- ments. The widely used search engine Google reports to index more than eight billion web pages.1 Another type of textual data is biological sequence data (e.g., DNA se- quences, protein sequences). A popular database for DNA sequences, GenBank, was reported to store over 33 billion bases for over 140 000 species in August 2003 and to grow at a rate of over 1700 new species per month [BKML+04]. The sheer size of these text collections makes on-line searches unfeasible. There- fore, a lot of research has been devoted to methods for efficient text retrieval [FBY92] and text indexes. A text index is a data structure prepared for a document or a collec- tion of documents that facilitates efficient queries for a search pattern. These queries can have different types. For a query with a single pattern, one can search, e.g., for all occurrences of a search string, for all occurrences of an element of a set of strings, or for all occurrences of a substring that has length twenty and contains at least ten equal characters—the possibilities for different criteria are endless and usually the result of 1These numbers were taken from http://www.google.com/corporate/facts.html as of March 2005.
Recommended publications
  • Compressed Suffix Trees with Full Functionality
    Compressed Suffix Trees with Full Functionality Kunihiko Sadakane Department of Computer Science and Communication Engineering, Kyushu University Hakozaki 6-10-1, Higashi-ku, Fukuoka 812-8581, Japan [email protected] Abstract We introduce new data structures for compressed suffix trees whose size are linear in the text size. The size is measured in bits; thus they occupy only O(n log |A|) bits for a text of length n on an alphabet A. This is a remarkable improvement on current suffix trees which require O(n log n) bits. Though some components of suffix trees have been compressed, there is no linear-size data structure for suffix trees with full functionality such as computing suffix links, string-depths and lowest common ancestors. The data structure proposed in this paper is the first one that has linear size and supports all operations efficiently. Any algorithm running on a suffix tree can also be executed on our compressed suffix trees with a slight slowdown of a factor of polylog(n). 1 Introduction Suffix trees are basic data structures for string algorithms [13]. A pattern can be found in time proportional to the pattern length from a text by constructing the suffix tree of the text in advance. The suffix tree can also be used for more complicated problems, for example finding the longest repeated substring in linear time. Many efficient string algorithms are based on the use of suffix trees because this does not increase the asymptotic time complexity. A suffix tree of a string can be constructed in linear time in the string length [28, 21, 27, 5].
    [Show full text]
  • Suffix Trees
    JASS 2008 Trees - the ubiquitous structure in computer science and mathematics Suffix Trees Caroline L¨obhard St. Petersburg, 9.3. - 19.3. 2008 1 Contents 1 Introduction to Suffix Trees 3 1.1 Basics . 3 1.2 Getting a first feeling for the nice structure of suffix trees . 4 1.3 A historical overview of algorithms . 5 2 Ukkonen’s on-line space-economic linear-time algorithm 6 2.1 High-level description . 6 2.2 Using suffix links . 7 2.3 Edge-label compression and the skip/count trick . 8 2.4 Two more observations . 9 3 Generalised Suffix Trees 9 4 Applications of Suffix Trees 10 References 12 2 1 Introduction to Suffix Trees A suffix tree is a tree-like data-structure for strings, which affords fast algorithms to find all occurrences of substrings. A given String S is preprocessed in O(|S|) time. Afterwards, for any other string P , one can decide in O(|P |) time, whether P can be found in S and denounce all its exact positions in S. This linear worst case time bound depending only on the length of the (shorter) string |P | is special and important for suffix trees since an amount of applications of string processing has to deal with large strings S. 1.1 Basics In this paper, we will denote the fixed alphabet with Σ, single characters with lower-case letters x, y, ..., strings over Σ with upper-case or Greek letters P, S, ..., α, σ, τ, ..., Trees with script letters T , ... and inner nodes of trees (that is, all nodes despite of root and leaves) with lower-case letters u, v, ...
    [Show full text]
  • Dictionary Look-Up Within Small Edit Distance
    Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications
    [Show full text]
  • Search Trees
    Lecture III Page 1 “Trees are the earth’s endless effort to speak to the listening heaven.” – Rabindranath Tagore, Fireflies, 1928 Alice was walking beside the White Knight in Looking Glass Land. ”You are sad.” the Knight said in an anxious tone: ”let me sing you a song to comfort you.” ”Is it very long?” Alice asked, for she had heard a good deal of poetry that day. ”It’s long.” said the Knight, ”but it’s very, very beautiful. Everybody that hears me sing it - either it brings tears to their eyes, or else -” ”Or else what?” said Alice, for the Knight had made a sudden pause. ”Or else it doesn’t, you know. The name of the song is called ’Haddocks’ Eyes.’” ”Oh, that’s the name of the song, is it?” Alice said, trying to feel interested. ”No, you don’t understand,” the Knight said, looking a little vexed. ”That’s what the name is called. The name really is ’The Aged, Aged Man.’” ”Then I ought to have said ’That’s what the song is called’?” Alice corrected herself. ”No you oughtn’t: that’s another thing. The song is called ’Ways and Means’ but that’s only what it’s called, you know!” ”Well, what is the song then?” said Alice, who was by this time completely bewildered. ”I was coming to that,” the Knight said. ”The song really is ’A-sitting On a Gate’: and the tune’s my own invention.” So saying, he stopped his horse and let the reins fall on its neck: then slowly beating time with one hand, and with a faint smile lighting up his gentle, foolish face, he began..
    [Show full text]
  • Approximate String Matching with Reduced Alphabet
    Approximate String Matching with Reduced Alphabet Leena Salmela1 and Jorma Tarhio2 1 University of Helsinki, Department of Computer Science [email protected] 2 Aalto University Deptartment of Computer Science and Engineering [email protected] Abstract. We present a method to speed up approximate string match- ing by mapping the factual alphabet to a smaller alphabet. We apply the alphabet reduction scheme to a tuned version of the approximate Boyer– Moore algorithm utilizing the Four-Russians technique. Our experiments show that the alphabet reduction makes the algorithm faster. Especially in the k-mismatch case, the new variation is faster than earlier algorithms for English data with small values of k. 1 Introduction The approximate string matching problem is defined as follows. We have a pat- tern P [1...m]ofm characters drawn from an alphabet Σ of size σ,atextT [1...n] of n characters over the same alphabet, and an integer k. We need to find all such positions i of the text that the distance between the pattern and a sub- string of the text ending at that position is at most k.Inthek-difference problem the distance between two strings is the standard edit distance where substitu- tions, deletions, and insertions are allowed. The k-mismatch problem is a more restricted one using the Hamming distance where only substitutions are allowed. Among the most cited papers on approximate string matching are the classical articles [1,2] by Esko Ukkonen. Besides them he has studied this topic extensively [3,4,5,6,7,8,9,10,11].
    [Show full text]
  • Lecture 1: Introduction
    Lecture 1: Introduction Agenda: • Welcome to CS 238 — Algorithmic Techniques in Computational Biology • Official course information • Course description • Announcements • Basic concepts in molecular biology 1 Lecture 1: Introduction Official course information • Grading weights: – 50% assignments (3-4) – 50% participation, presentation, and course report. • Text and reference books: – T. Jiang, Y. Xu and M. Zhang (co-eds), Current Topics in Computational Biology, MIT Press, 2002. (co-published by Tsinghua University Press in China) – D. Gusfield, Algorithms for Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge Press, 1997. – D. Krane and M. Raymer, Fundamental Concepts of Bioin- formatics, Benjamin Cummings, 2003. – P. Pevzner, Computational Molecular Biology: An Algo- rithmic Approach, 2000, the MIT Press. – M. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes, Chapman and Hall, 1995. – These notes (typeset by Guohui Lin and Tao Jiang). • Instructor: Tao Jiang – Surge Building 330, x82991, [email protected] – Office hours: Tuesday & Thursday 3-4pm 2 Lecture 1: Introduction Course overview • Topics covered: – Biological background introduction – My research topics – Sequence homology search and comparison – Sequence structural comparison – string matching algorithms and suffix tree – Genome rearrangement – Protein structure and function prediction – Phylogenetic reconstruction * DNA sequencing, sequence assembly * Physical/restriction mapping * Prediction of regulatiory elements • Announcements:
    [Show full text]
  • Approximate Boyer-Moore String Matching
    APPROXIMATE BOYER-MOORE STRING MATCHING JORMA TARHIO AND ESKO UKKONEN University of Helsinki, Department of Computer Science Teollisuuskatu 23, SF-00510 Helsinki, Finland Draft Abstract. The Boyer-Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length m) in a text string (length n) with at most k mis- matches. Our generalized Boyer-Moore algorithm is shown (under a mild 1 independence assumption) to solve the problem in expected time O(kn( + m – k k)) where c is the size of the alphabet. A related algorithm is developed for the c k differences problem where the task is to find all approximate occurrences of a pattern in a text with ≤ k differences (insertions, deletions, changes). Experimental evaluation of the algorithms is reported showing that the new algorithms are often significantly faster than the old ones. Both algorithms are functionally equivalent with the Horspool version of the Boyer-Moore algorithm when k = 0. Key words: String matching, edit distance, Boyer-Moore algorithm, k mismatches problem, k differences problem AMS (MOS) subject classifications: 68C05, 68C25, 68H05 Abbreviated title: Approximate Boyer-Moore Matching 2 1. Introduction The fastest known exact string matching algorithms are based on the Boyer- Moore idea [BoM77, KMP77]. Such algorithms are “sublinear” on the average in the sense that it is not necessary to check every symbol in the text. The larger is the alphabet and the longer is the pattern, the faster the algorithm works.
    [Show full text]
  • Suffix Trees for Fast Sensor Data Forwarding
    Suffix Trees for Fast Sensor Data Forwarding Jui-Chieh Wub Hsueh-I Lubc Polly Huangac Department of Electrical Engineeringa Department of Computer Science and Information Engineeringb Graduate Institute of Networking and Multimediac National Taiwan University [email protected], [email protected], [email protected] Abstract— In data-centric wireless sensor networks, data are alleviates the effort of node addressing and address reconfig- no longer sent by the sink node’s address. Instead, the sink uration in large-scale mobile sensor networks. node sends an explicit interest for a particular type of data. The source and the intermediate nodes then forward the data Forwarding in data-centric sensor networks is particularly according to the routing states set by the corresponding interest. challenging. It involves matching of the data content, i.e., This data-centric style of communication is promising in that it string-based attributes and values, instead of numeric ad- alleviates the effort of node addressing and address reconfigura- dresses. This content-based forwarding problem is well studied tion. However, when the number of interests from different sinks in the domain of publish-subscribe systems. It is estimated increases, the size of the interest table grows. It could take 10s to 100s milliseconds to match an incoming data to a particular in [3] that the time it takes to match an incoming data to a interest, which is orders of mangnitude higher than the typical particular interest ranges from 10s to 100s milliseconds. This transmission and propagation delay in a wireless sensor network. processing delay is several orders of magnitudes higher than The interest table lookup process is the bottleneck of packet the propagation and transmission delay.
    [Show full text]
  • Problem Set 7 Solutions
    Introduction to Algorithms November 18, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 25 Problem Set 7 Solutions Problem 7-1. Edit distance In this problem you will write a program to compute edit distance. This problem is mandatory. Failure to turn in a solution will result in a serious and negative impact on your term grade! We advise you to start this programming assignment as soon as possible, because getting all the details right in a program can take longer than you think. Many word processors and keyword search engines have a spelling correction feature. If you type in a misspelled word x, the word processor or search engine can suggest a correction y. The correction y should be a word that is close to x. One way to measure the similarity in spelling between two text strings is by “edit distance.” The notion of edit distance is useful in other fields as well. For example, biologists use edit distance to characterize the similarity of DNA or protein sequences. The edit distance d(x; y) of two strings of text, x[1 : : m] and y[1 : : n], is defined to be the minimum possible cost of a sequence of “transformation operations” (defined below) that transforms string x[1 : : m] into string y[1 : : n].1 To define the effect of the transformation operations, we use an auxiliary string z[1 : : s] that holds the intermediate results. At the beginning of the transformation sequence, s = m and z[1 : : s] = x[1 : : m] (i.e., we start with string x[1 : : m]).
    [Show full text]
  • Suffix Trees and Suffix Arrays in Primary and Secondary Storage Pang Ko Iowa State University
    Iowa State University Capstones, Theses and Retrospective Theses and Dissertations Dissertations 2007 Suffix trees and suffix arrays in primary and secondary storage Pang Ko Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/rtd Part of the Bioinformatics Commons, and the Computer Sciences Commons Recommended Citation Ko, Pang, "Suffix trees and suffix arrays in primary and secondary storage" (2007). Retrospective Theses and Dissertations. 15942. https://lib.dr.iastate.edu/rtd/15942 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Suffix trees and suffix arrays in primary and secondary storage by Pang Ko A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Computer Engineering Program of Study Committee: Srinivas Aluru, Major Professor David Fern´andez-Baca Suraj Kothari Patrick Schnable Srikanta Tirthapura Iowa State University Ames, Iowa 2007 UMI Number: 3274885 UMI Microform 3274885 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI 48106-1346 ii DEDICATION To my parents iii TABLE OF CONTENTS LISTOFTABLES ................................... v LISTOFFIGURES .................................. vi ACKNOWLEDGEMENTS. .. .. .. .. .. .. .. .. .. ... .. .. .. .. vii ABSTRACT....................................... viii CHAPTER1. INTRODUCTION . 1 1.1 SuffixArrayinMainMemory .
    [Show full text]
  • 3. Approximate String Matching
    3. Approximate String Matching Often in applications we want to search a text for something that is similar to the pattern but not necessarily exactly the same. To formalize this problem, we have to specify what does “similar” mean. This can be done by defining a similarity or a distance measure. A natural and popular distance measure for strings is the edit distance, also known as the Levenshtein distance. 109 Edit distance The edit distance ed(A, B) of two strings A and B is the minimum number of edit operations needed to change A into B. The allowed edit operations are: S Substitution of a single character with another character. I Insertion of a single character. D Deletion of a single character. Example 3.1: Let A = Lewensteinn and B = Levenshtein. Then ed(A, B) = 3. The set of edit operations can be described with an edit sequence: NNSNNNINNNND or with an alignment: Lewens-teinn Levenshtein- In the edit sequence, N means No edit. 110 There are many variations and extension of the edit distance, for example: Hamming distance allows only the subtitution operation. • Damerau–Levenshtein distance adds an edit operation: • T Transposition swaps two adjacent characters. With weighted edit distance, each operation has a cost or weight, • which can be other than one. Allow insertions and deletions (indels) of factors at a cost that is lower • than the sum of character indels. We will focus on the basic Levenshtein distance. Levenshtein distance has the following two useful properties, which are not shared by all variations (exercise): Levenshtein distance is a metric.
    [Show full text]
  • Finger Search Trees
    11 Finger Search Trees 11.1 Finger Searching....................................... 11-1 11.2 Dynamic Finger Search Trees ....................... 11-2 11.3 Level Linked (2,4)-Trees .............................. 11-3 11.4 Randomized Finger Search Trees ................... 11-4 Treaps • Skip Lists 11.5 Applications............................................ 11-6 Optimal Merging and Set Operations • Arbitrary Gerth Stølting Brodal Merging Order • List Splitting • Adaptive Merging and University of Aarhus Sorting 11.1 Finger Searching One of the most studied problems in computer science is the problem of maintaining a sorted sequence of elements to facilitate efficient searches. The prominent solution to the problem is to organize the sorted sequence as a balanced search tree, enabling insertions, deletions and searches in logarithmic time. Many different search trees have been developed and studied intensively in the literature. A discussion of balanced binary search trees can e.g. be found in [4]. This chapter is devoted to finger search trees which are search trees supporting fingers, i.e. pointers, to elements in the search trees and supporting efficient updates and searches in the vicinity of the fingers. If the sorted sequence is a static set of n elements then a simple and space efficient representation is a sorted array. Searches can be performed by binary search using 1+⌊log n⌋ comparisons (we throughout this chapter let log x denote log2 max{2, x}). A finger search starting at a particular element of the array can be performed by an exponential search by inspecting elements at distance 2i − 1 from the finger for increasing i followed by a binary search in a range of 2⌊log d⌋ − 1 elements, where d is the rank difference in the sequence between the finger and the search element.
    [Show full text]