String Matching Algorithms and Their Applicability in Various Applications
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Pattern Matching Using Similarity Measures
Pattern matching using similarity measures Patroonvergelijking met behulp van gelijkenismaten (met een samenvatting in het Nederlands) PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van Rector Magnificus, Prof. Dr. H. O. Voorma, ingevolge het besluit van het College voor Promoties in het openbaar te verdedigen op maandag 18 september 2000 des morgens te 10:30 uur door Michiel Hagedoorn geboren op 13 juli 1972, te Renkum promotor: Prof. Dr. M. H. Overmars Faculteit Wiskunde & Informatica co-promotor: Dr. R. C. Veltkamp Faculteit Wiskunde & Informatica ISBN 90-393-2460-3 PHILIPS '$ The&&% research% described in this thesis has been made possible by financial support from Philips Research Laboratories. The work in this thesis has been carried out in the graduate school ASCI. Contents 1 Introduction 1 1.1Patternmatching.......................... 1 1.2Applications............................. 4 1.3Obtaininggeometricpatterns................... 7 1.4 Paradigms in geometric pattern matching . 8 1.5Similaritymeasurebasedpatternmatching........... 11 1.6Overviewofthisthesis....................... 16 2 A theory of similarity measures 21 2.1Pseudometricspaces........................ 22 2.2Pseudometricpatternspaces................... 30 2.3Embeddingpatternsinafunctionspace............. 40 2.4TheHausdorffmetric........................ 46 2.5Thevolumeofsymmetricdifference............... 54 2.6 Reflection visibility based distances . 60 2.7Summary.............................. 71 2.8Experimentalresults....................... -
Pattern Matching Using Fuzzy Methods David Bell and Lynn Palmer, State of California: Genetic Disease Branch
Pattern Matching Using Fuzzy Methods David Bell and Lynn Palmer, State of California: Genetic Disease Branch ABSTRACT The formula is : Two major methods of detecting similarities Euclidean Distance and 2 a "fuzzy Hamming distance" presented in a paper entitled "F %%y Dij : ; ( <3/i - yj 4 Hamming Distance: A New Dissimilarity Measure" by Bookstein, Klein, and Raita, will be compared using entropy calculations to Euclidean distance is especially useful for comparing determine their abilities to detect patterns in data and match data matching whole word object elements of 2 vectors. The records. .e find that both means of measuring distance are useful code in Java is: depending on the context. .hile fuzzy Hamming distance outperforms in supervised learning situations such as searches, the Listing 1: Euclidean Distance Euclidean distance measure is more useful in unsupervised pattern matching and clustering of data. /** * EuclideanDistance.java INTRODUCTION * * Pattern matching is becoming more of a necessity given the needs for * Created: Fri Oct 07 08:46:40 2002 such methods for detecting similarities in records, epidemiological * * @author David Bell: DHS-GENETICS patterns, genetics and even in the emerging fields of criminal * @version 1.0 behavioral pattern analysis and disease outbreak analysis due to */ possible terrorist activity. 0nfortunately, traditional methods for /** Abstact Distance class linking or matching records, data fields, etc. rely on exact data * @param None matches rather than looking for close matches or patterns. 1f course * @return Distance Template proximity pattern matches are often necessary when dealing with **/ messy data, data that has inexact values and/or data with missing key abstract class distance{ values. -
Dictionary Look-Up Within Small Edit Distance
Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications -
Approximate String Matching with Reduced Alphabet
Approximate String Matching with Reduced Alphabet Leena Salmela1 and Jorma Tarhio2 1 University of Helsinki, Department of Computer Science [email protected] 2 Aalto University Deptartment of Computer Science and Engineering [email protected] Abstract. We present a method to speed up approximate string match- ing by mapping the factual alphabet to a smaller alphabet. We apply the alphabet reduction scheme to a tuned version of the approximate Boyer– Moore algorithm utilizing the Four-Russians technique. Our experiments show that the alphabet reduction makes the algorithm faster. Especially in the k-mismatch case, the new variation is faster than earlier algorithms for English data with small values of k. 1 Introduction The approximate string matching problem is defined as follows. We have a pat- tern P [1...m]ofm characters drawn from an alphabet Σ of size σ,atextT [1...n] of n characters over the same alphabet, and an integer k. We need to find all such positions i of the text that the distance between the pattern and a sub- string of the text ending at that position is at most k.Inthek-difference problem the distance between two strings is the standard edit distance where substitu- tions, deletions, and insertions are allowed. The k-mismatch problem is a more restricted one using the Hamming distance where only substitutions are allowed. Among the most cited papers on approximate string matching are the classical articles [1,2] by Esko Ukkonen. Besides them he has studied this topic extensively [3,4,5,6,7,8,9,10,11]. -
CSCI 2041: Pattern Matching Basics
CSCI 2041: Pattern Matching Basics Chris Kauffman Last Updated: Fri Sep 28 08:52:58 CDT 2018 1 Logistics Reading Assignment 2 I OCaml System Manual: Ch I Demo in lecture 1.4 - 1.5 I Post today/tomorrow I Practical OCaml: Ch 4 Next Week Goals I Mon: Review I Code patterns I Wed: Exam 1 I Pattern Matching I Fri: Lecture 2 Consider: Summing Adjacent Elements 1 (* match_basics.ml: basic demo of pattern matching *) 2 3 (* Create a list comprised of the sum of adjacent pairs of 4 elements in list. The last element in an odd-length list is 5 part of the return as is. *) 6 let rec sum_adj_ie list = 7 if list = [] then (* CASE of empty list *) 8 [] (* base case *) 9 else 10 let a = List.hd list in (* DESTRUCTURE list *) 11 let atail = List.tl list in (* bind names *) 12 if atail = [] then (* CASE of 1 elem left *) 13 [a] (* base case *) 14 else (* CASE of 2 or more elems left *) 15 let b = List.hd atail in (* destructure list *) 16 let tail = List.tl atail in (* bind names *) 17 (a+b) :: (sum_adj_ie tail) (* recursive case *) The above function follows a common paradigm: I Select between Cases during a computation I Cases are based on structure of data I Data is Destructured to bind names to parts of it 3 Pattern Matching in Programming Languages I Pattern Matching as a programming language feature checks that data matches a certain structure the executes if so I Can take many forms such as processing lines of input files that match a regular expression I Pattern Matching in OCaml/ML combines I Case analysis: does the data match a certain structure I Destructure Binding: bind names to parts of the data I Pattern Matching gives OCaml/ML a certain "cool" factor I Associated with the match/with syntax as follows match something with | pattern1 -> result1 (* pattern1 gives result1 *) | pattern2 -> (* pattern 2.. -
Compiling Pattern Matching to Good Decision Trees
Submitted to ML’08 Compiling Pattern Matching to good Decision Trees Luc Maranget INRIA Luc.marangetinria.fr Abstract In this paper we study compilation to decision tree, whose We address the issue of compiling ML pattern matching to efficient primary advantage is never testing a given subterm of the subject decisions trees. Traditionally, compilation to decision trees is op- value more than once (and whose primary drawback is potential timized by (1) implementing decision trees as dags with maximal code size explosion). Our aim is to refine naive compilation to sharing; (2) guiding a simple compiler with heuristics. We first de- decision trees, and to compare the output of such an optimizing sign new heuristics that are inspired by necessity, a notion from compiler with optimized backtracking automata. lazy pattern matching that we rephrase in terms of decision tree se- Compilation to decision can be very sensitive to the testing mantics. Thereby, we simplify previous semantical frameworks and order of subject value subterms. The situation can be explained demonstrate a direct connection between necessity and decision by the example of an human programmer attempting to translate a ML program into a lower-level language without pattern matching. tree runtime efficiency. We complete our study by experiments, 1 showing that optimized compilation to decision trees is competi- Let f be the following function defined on triples of booleans : tive. We also suggest some heuristics precisely. l e t f x y z = match x,y,z with | _,F,T -> 1 Categories and Subject Descriptors D 3. 3 [Programming Lan- | F,T,_ -> 2 guages]: Language Constructs and Features—Patterns | _,_,F -> 3 | _,_,T -> 4 General Terms Design, Performance, Sequentiality. -
Approximate Boyer-Moore String Matching
APPROXIMATE BOYER-MOORE STRING MATCHING JORMA TARHIO AND ESKO UKKONEN University of Helsinki, Department of Computer Science Teollisuuskatu 23, SF-00510 Helsinki, Finland Draft Abstract. The Boyer-Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length m) in a text string (length n) with at most k mis- matches. Our generalized Boyer-Moore algorithm is shown (under a mild 1 independence assumption) to solve the problem in expected time O(kn( + m – k k)) where c is the size of the alphabet. A related algorithm is developed for the c k differences problem where the task is to find all approximate occurrences of a pattern in a text with ≤ k differences (insertions, deletions, changes). Experimental evaluation of the algorithms is reported showing that the new algorithms are often significantly faster than the old ones. Both algorithms are functionally equivalent with the Horspool version of the Boyer-Moore algorithm when k = 0. Key words: String matching, edit distance, Boyer-Moore algorithm, k mismatches problem, k differences problem AMS (MOS) subject classifications: 68C05, 68C25, 68H05 Abbreviated title: Approximate Boyer-Moore Matching 2 1. Introduction The fastest known exact string matching algorithms are based on the Boyer- Moore idea [BoM77, KMP77]. Such algorithms are “sublinear” on the average in the sense that it is not necessary to check every symbol in the text. The larger is the alphabet and the longer is the pattern, the faster the algorithm works. -
Combinatorial Pattern Matching
Combinatorial Pattern Matching 1 A Recurring Problem Finding patterns within sequences Variants on this idea Finding repeated motifs amoungst a set of strings What are the most frequent k-mers How many time does a specific k-mer appear Fundamental problem: Pattern Matching Find all positions of a particular substring in given sequence? 2 Pattern Matching Goal: Find all occurrences of a pattern in a text Input: Pattern p = p1, p2, … pn and text t = t1, t2, … tm Output: All positions 1 < i < (m – n + 1) such that the n-letter substring of t starting at i matches p def bruteForcePatternMatching(p, t): locations = [] for i in xrange(0, len(t)-len(p)+1): if t[i:i+len(p)] == p: locations.append(i) return locations print bruteForcePatternMatching("ssi", "imissmissmississippi") [11, 14] 3 Pattern Matching Performance Performance: m - length of the text t n - the length of the pattern p Search Loop - executed O(m) times Comparison - O(n) symbols compared Total cost - O(mn) per pattern In practice, most comparisons terminate early Worst-case: p = "AAAT" t = "AAAAAAAAAAAAAAAAAAAAAAAT" 4 We can do better! If we preprocess our pattern we can search more effciently (O(n)) Example: imissmissmississippi 1. s 2. s 3. s 4. SSi 5. s 6. SSi 7. s 8. SSI - match at 11 9. SSI - match at 14 10. s 11. s 12. s At steps 4 and 6 after finding the mismatch i ≠ m we can skip over all positions tested because we know that the suffix "sm" is not a prefix of our pattern "ssi" Even works for our worst-case example "AAAAT" in "AAAAAAAAAAAAAAT" by recognizing the shared prefixes ("AAA" in "AAAA"). -
Problem Set 7 Solutions
Introduction to Algorithms November 18, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 25 Problem Set 7 Solutions Problem 7-1. Edit distance In this problem you will write a program to compute edit distance. This problem is mandatory. Failure to turn in a solution will result in a serious and negative impact on your term grade! We advise you to start this programming assignment as soon as possible, because getting all the details right in a program can take longer than you think. Many word processors and keyword search engines have a spelling correction feature. If you type in a misspelled word x, the word processor or search engine can suggest a correction y. The correction y should be a word that is close to x. One way to measure the similarity in spelling between two text strings is by “edit distance.” The notion of edit distance is useful in other fields as well. For example, biologists use edit distance to characterize the similarity of DNA or protein sequences. The edit distance d(x; y) of two strings of text, x[1 : : m] and y[1 : : n], is defined to be the minimum possible cost of a sequence of “transformation operations” (defined below) that transforms string x[1 : : m] into string y[1 : : n].1 To define the effect of the transformation operations, we use an auxiliary string z[1 : : s] that holds the intermediate results. At the beginning of the transformation sequence, s = m and z[1 : : s] = x[1 : : m] (i.e., we start with string x[1 : : m]). -
3. Approximate String Matching
3. Approximate String Matching Often in applications we want to search a text for something that is similar to the pattern but not necessarily exactly the same. To formalize this problem, we have to specify what does “similar” mean. This can be done by defining a similarity or a distance measure. A natural and popular distance measure for strings is the edit distance, also known as the Levenshtein distance. 109 Edit distance The edit distance ed(A, B) of two strings A and B is the minimum number of edit operations needed to change A into B. The allowed edit operations are: S Substitution of a single character with another character. I Insertion of a single character. D Deletion of a single character. Example 3.1: Let A = Lewensteinn and B = Levenshtein. Then ed(A, B) = 3. The set of edit operations can be described with an edit sequence: NNSNNNINNNND or with an alignment: Lewens-teinn Levenshtein- In the edit sequence, N means No edit. 110 There are many variations and extension of the edit distance, for example: Hamming distance allows only the subtitution operation. • Damerau–Levenshtein distance adds an edit operation: • T Transposition swaps two adjacent characters. With weighted edit distance, each operation has a cost or weight, • which can be other than one. Allow insertions and deletions (indels) of factors at a cost that is lower • than the sum of character indels. We will focus on the basic Levenshtein distance. Levenshtein distance has the following two useful properties, which are not shared by all variations (exercise): Levenshtein distance is a metric. -
Integrating Pattern Matching Within String Scanning a Thesis
Integrating Pattern Matching Within String Scanning A Thesis Presented in Partial Fulfillment of the Requirements for the Degree of Master of Science with a Major in Computer Science in the College of Graduate Studies University of Idaho by John H. Goettsche Major Professor: Clinton Jeffery, Ph.D. Committee Members: Robert Heckendorn, Ph.D.; Robert Rinker, Ph.D. Department Administrator: Frederick Sheldon, Ph.D. July 2015 ii Authorization to Submit Thesis This Thesis of John H. Goettsche, submitted for the degree of Master of Science with a Major in Computer Science and titled \Integrating Pattern Matching Within String Scanning," has been reviewed in final form. Permission, as indicated by the signatures and dates below, is now granted to submit final copies to the College of Graduate Studies for approval. Major Professor: Date: Clinton Jeffery, Ph.D. Committee Members: Date: Robert Heckendorn, Ph.D. Date: Robert Rinker, Ph.D. Department Administrator: Date: Frederick Sheldon, Ph.D. iii Abstract A SNOBOL4 like pattern data type and pattern matching operation were introduced to the Unicon language in 2005, but patterns were not integrated with the Unicon string scanning control structure and hence, the SNOBOL style patterns were not adopted as part of the language at that time. The goal of this project is to make the pattern data type accessible to the Unicon string scanning control structure and vice versa; and also make the pattern operators and functions lexically consistent with Unicon. To accomplish these goals, a Unicon string matching operator was changed to allow the execution of a pattern match in the anchored mode, pattern matching unevaluated expressions were revised to handle complex string scanning functions, and the pattern matching lexemes were revised to be more consistent with the Unicon language. -
Practical Authenticated Pattern Matching with Optimal Proof Size
Practical Authenticated Pattern Matching with Optimal Proof Size Dimitrios Papadopoulos Charalampos Papamanthou Boston University University of Maryland [email protected] [email protected] Roberto Tamassia Nikos Triandopoulos Brown University RSA Laboratories & Boston University [email protected] [email protected] ABSTRACT otherwise. More elaborate models for pattern matching involve We address the problem of authenticating pattern matching queries queries expressed as regular expressions over Σ or returning multi- over textual data that is outsourced to an untrusted cloud server. By ple occurrences of p, and databases allowing search over multiple employing cryptographic accumulators in a novel optimal integrity- texts or other (semi-)structured data (e.g., XML data). This core checking tool built directly over a suffix tree, we design the first data-processing problem has numerous applications in a wide range authenticated data structure for verifiable answers to pattern match- of topics including intrusion detection, spam filtering, web search ing queries featuring fast generation of constant-size proofs. We engines, molecular biology and natural language processing. present two main applications of our new construction to authen- Previous works on authenticated pattern matching include the ticate: (i) pattern matching queries over text documents, and (ii) schemes by Martel et al. [28] for text pattern matching, and by De- exact path queries over XML documents. Answers to queries are vanbu et al. [16] and Bertino et al. [10] for XML search. In essence, verified by proofs of size at most 500 bytes for text pattern match- these works adopt the same general framework: First, by hierarchi- ing, and at most 243 bytes for exact path XML search, indepen- cally applying a cryptographic hash function (e.g., SHA-2) over the dently of the document or answer size.