How Hard Is Computing the Edit Distance?1
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
CS601 DTIME and DSPACE Lecture 5 Time and Space Functions: T, S
CS601 DTIME and DSPACE Lecture 5 Time and Space functions: t, s : N → N+ Definition 5.1 A set A ⊆ U is in DTIME[t(n)] iff there exists a deterministic, multi-tape TM, M, and a constant c, such that, 1. A = L(M) ≡ w ∈ U M(w)=1 , and 2. ∀w ∈ U, M(w) halts within c · t(|w|) steps. Definition 5.2 A set A ⊆ U is in DSPACE[s(n)] iff there exists a deterministic, multi-tape TM, M, and a constant c, such that, 1. A = L(M), and 2. ∀w ∈ U, M(w) uses at most c · s(|w|) work-tape cells. (Input tape is “read-only” and not counted as space used.) Example: PALINDROMES ∈ DTIME[n], DSPACE[n]. In fact, PALINDROMES ∈ DSPACE[log n]. [Exercise] 1 CS601 F(DTIME) and F(DSPACE) Lecture 5 Definition 5.3 f : U → U is in F (DTIME[t(n)]) iff there exists a deterministic, multi-tape TM, M, and a constant c, such that, 1. f = M(·); 2. ∀w ∈ U, M(w) halts within c · t(|w|) steps; 3. |f(w)|≤|w|O(1), i.e., f is polynomially bounded. Definition 5.4 f : U → U is in F (DSPACE[s(n)]) iff there exists a deterministic, multi-tape TM, M, and a constant c, such that, 1. f = M(·); 2. ∀w ∈ U, M(w) uses at most c · s(|w|) work-tape cells; 3. |f(w)|≤|w|O(1), i.e., f is polynomially bounded. (Input tape is “read-only”; Output tape is “write-only”. -
The Complexity of Space Bounded Interactive Proof Systems
The Complexity of Space Bounded Interactive Proof Systems ANNE CONDON Computer Science Department, University of Wisconsin-Madison 1 INTRODUCTION Some of the most exciting developments in complexity theory in recent years concern the complexity of interactive proof systems, defined by Goldwasser, Micali and Rackoff (1985) and independently by Babai (1985). In this paper, we survey results on the complexity of space bounded interactive proof systems and their applications. An early motivation for the study of interactive proof systems was to extend the notion of NP as the class of problems with efficient \proofs of membership". Informally, a prover can convince a verifier in polynomial time that a string is in an NP language, by presenting a witness of that fact to the verifier. Suppose that the power of the verifier is extended so that it can flip coins and can interact with the prover during the course of a proof. In this way, a verifier can gather statistical evidence that an input is in a language. As we will see, the interactive proof system model precisely captures this in- teraction between a prover P and a verifier V . In the model, the computation of V is probabilistic, but is typically restricted in time or space. A language is accepted by the interactive proof system if, for all inputs in the language, V accepts with high probability, based on the communication with the \honest" prover P . However, on inputs not in the language, V rejects with high prob- ability, even when communicating with a \dishonest" prover. In the general model, V can keep its coin flips secret from the prover. -
Lecture 14 (Feb 28): Probabilistically Checkable Proofs (PCP) 14.1 Motivation: Some Problems and Their Approximation Algorithms
CMPUT 675: Computational Complexity Theory Winter 2019 Lecture 14 (Feb 28): Probabilistically Checkable Proofs (PCP) Lecturer: Zachary Friggstad Scribe: Haozhou Pang 14.1 Motivation: Some problems and their approximation algorithms Many optimization problems are NP-hard, and therefore it is unlikely to efficiently find optimal solutions. However, in some situations, finding a provably good-enough solution (approximation of the optimal) is also acceptable. We say an algorithm is an α-approximation algorithm for an optimization problem iff for every instance of the problem it can find a solution whose cost is a factor of α of the optimum solution cost. Here are some selected problems and their approximation algorithms: Example: Min Vertex Cover is the problem that given a graph G = (V; E), the goal is to find a vertex cover C ⊆ V of minimum size. The following algorithm gives a 2-approximation for Min Vertex Cover: Algorithm 1 Min Vertex Cover Algorithm Input: Graph G = (V; E) Output: a vertex cover C of G. C ; while some edge (u; v) has u; v2 = C do C C [ fu; vg end while return C Claim 1 Let C∗ be an optimal solution, the returned solution C satisfies jCj ≤ 2jC∗j. Proof. Let M be the edges considered by the algorithm in the loop, we have jCj = 2jMj. Also, jC∗j covers M and no two edges in M share an endpoint (M is a matching), so jC∗j ≥ jMj. Therefore, jCj = 2jMj ≤ 2jC∗j. Example: Max SAT is the problem that given a CNF formula, the goal is to find a assignment that satisfies as many clauses as possible. -
Dictionary Look-Up Within Small Edit Distance
Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications -
Approximate String Matching with Reduced Alphabet
Approximate String Matching with Reduced Alphabet Leena Salmela1 and Jorma Tarhio2 1 University of Helsinki, Department of Computer Science [email protected] 2 Aalto University Deptartment of Computer Science and Engineering [email protected] Abstract. We present a method to speed up approximate string match- ing by mapping the factual alphabet to a smaller alphabet. We apply the alphabet reduction scheme to a tuned version of the approximate Boyer– Moore algorithm utilizing the Four-Russians technique. Our experiments show that the alphabet reduction makes the algorithm faster. Especially in the k-mismatch case, the new variation is faster than earlier algorithms for English data with small values of k. 1 Introduction The approximate string matching problem is defined as follows. We have a pat- tern P [1...m]ofm characters drawn from an alphabet Σ of size σ,atextT [1...n] of n characters over the same alphabet, and an integer k. We need to find all such positions i of the text that the distance between the pattern and a sub- string of the text ending at that position is at most k.Inthek-difference problem the distance between two strings is the standard edit distance where substitu- tions, deletions, and insertions are allowed. The k-mismatch problem is a more restricted one using the Hamming distance where only substitutions are allowed. Among the most cited papers on approximate string matching are the classical articles [1,2] by Esko Ukkonen. Besides them he has studied this topic extensively [3,4,5,6,7,8,9,10,11]. -
Backtrack Parsing Context-Free Grammar Context-Free Grammar
Context-free Grammar Problems with Regular Context-free Grammar Language and Is English a regular language? Bad question! We do not even know what English is! Two eggs and bacon make(s) a big breakfast Backtrack Parsing Can you slide me the salt? He didn't ought to do that But—No! Martin Kay I put the wine you brought in the fridge I put the wine you brought for Sandy in the fridge Should we bring the wine you put in the fridge out Stanford University now? and University of the Saarland You said you thought nobody had the right to claim that they were above the law Martin Kay Context-free Grammar 1 Martin Kay Context-free Grammar 2 Problems with Regular Problems with Regular Language Language You said you thought nobody had the right to claim [You said you thought [nobody had the right [to claim that they were above the law that [they were above the law]]]] Martin Kay Context-free Grammar 3 Martin Kay Context-free Grammar 4 Problems with Regular Context-free Grammar Language Nonterminal symbols ~ grammatical categories Is English mophology a regular language? Bad question! We do not even know what English Terminal Symbols ~ words morphology is! They sell collectables of all sorts Productions ~ (unordered) (rewriting) rules This concerns unredecontaminatability Distinguished Symbol This really is an untiable knot. But—Probably! (Not sure about Swahili, though) Not all that important • Terminals and nonterminals are disjoint • Distinguished symbol Martin Kay Context-free Grammar 5 Martin Kay Context-free Grammar 6 Context-free Grammar Context-free -
Adaptive LL(*) Parsing: the Power of Dynamic Analysis
Adaptive LL(*) Parsing: The Power of Dynamic Analysis Terence Parr Sam Harwell Kathleen Fisher University of San Francisco University of Texas at Austin Tufts University [email protected] [email protected] kfi[email protected] Abstract PEGs are unambiguous by definition but have a quirk where Despite the advances made by modern parsing strategies such rule A ! a j ab (meaning “A matches either a or ab”) can never as PEG, LL(*), GLR, and GLL, parsing is not a solved prob- match ab since PEGs choose the first alternative that matches lem. Existing approaches suffer from a number of weaknesses, a prefix of the remaining input. Nested backtracking makes de- including difficulties supporting side-effecting embedded ac- bugging PEGs difficult. tions, slow and/or unpredictable performance, and counter- Second, side-effecting programmer-supplied actions (muta- intuitive matching strategies. This paper introduces the ALL(*) tors) like print statements should be avoided in any strategy that parsing strategy that combines the simplicity, efficiency, and continuously speculates (PEG) or supports multiple interpreta- predictability of conventional top-down LL(k) parsers with the tions of the input (GLL and GLR) because such actions may power of a GLR-like mechanism to make parsing decisions. never really take place [17]. (Though DParser [24] supports The critical innovation is to move grammar analysis to parse- “final” actions when the programmer is certain a reduction is time, which lets ALL(*) handle any non-left-recursive context- part of an unambiguous final parse.) Without side effects, ac- free grammar. ALL(*) is O(n4) in theory but consistently per- tions must buffer data for all interpretations in immutable data forms linearly on grammars used in practice, outperforming structures or provide undo actions. -
Approximate Boyer-Moore String Matching
APPROXIMATE BOYER-MOORE STRING MATCHING JORMA TARHIO AND ESKO UKKONEN University of Helsinki, Department of Computer Science Teollisuuskatu 23, SF-00510 Helsinki, Finland Draft Abstract. The Boyer-Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length m) in a text string (length n) with at most k mis- matches. Our generalized Boyer-Moore algorithm is shown (under a mild 1 independence assumption) to solve the problem in expected time O(kn( + m – k k)) where c is the size of the alphabet. A related algorithm is developed for the c k differences problem where the task is to find all approximate occurrences of a pattern in a text with ≤ k differences (insertions, deletions, changes). Experimental evaluation of the algorithms is reported showing that the new algorithms are often significantly faster than the old ones. Both algorithms are functionally equivalent with the Horspool version of the Boyer-Moore algorithm when k = 0. Key words: String matching, edit distance, Boyer-Moore algorithm, k mismatches problem, k differences problem AMS (MOS) subject classifications: 68C05, 68C25, 68H05 Abbreviated title: Approximate Boyer-Moore Matching 2 1. Introduction The fastest known exact string matching algorithms are based on the Boyer- Moore idea [BoM77, KMP77]. Such algorithms are “sublinear” on the average in the sense that it is not necessary to check every symbol in the text. The larger is the alphabet and the longer is the pattern, the faster the algorithm works. -
Sequence Alignment/Map Format Specification
Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345. -
Lexing and Parsing with ANTLR4
Lab 2 Lexing and Parsing with ANTLR4 Objective • Understand the software architecture of ANTLR4. • Be able to write simple grammars and correct grammar issues in ANTLR4. EXERCISE #1 Lab preparation Ï In the cap-labs directory: git pull will provide you all the necessary files for this lab in TP02. You also have to install ANTLR4. 2.1 User install for ANTLR4 and ANTLR4 Python runtime User installation steps: mkdir ~/lib cd ~/lib wget http://www.antlr.org/download/antlr-4.7-complete.jar pip3 install antlr4-python3-runtime --user Then in your .bashrc: export CLASSPATH=".:$HOME/lib/antlr-4.7-complete.jar:$CLASSPATH" export ANTLR4="java -jar $HOME/lib/antlr-4.7-complete.jar" alias antlr4="java -jar $HOME/lib/antlr-4.7-complete.jar" alias grun='java org.antlr.v4.gui.TestRig' Then source your .bashrc: source ~/.bashrc 2.2 Structure of a .g4 file and compilation Links to a bit of ANTLR4 syntax : • Lexical rules (extended regular expressions): https://github.com/antlr/antlr4/blob/4.7/doc/ lexer-rules.md • Parser rules (grammars) https://github.com/antlr/antlr4/blob/4.7/doc/parser-rules.md The compilation of a given .g4 (for the PYTHON back-end) is done by the following command line: java -jar ~/lib/antlr-4.7-complete.jar -Dlanguage=Python3 filename.g4 or if you modified your .bashrc properly: antlr4 -Dlanguage=Python3 filename.g4 2.3 Simple examples with ANTLR4 EXERCISE #2 Demo files Ï Work your way through the five examples in the directory demo_files: Aurore Alcolei, Laure Gonnord, Valentin Lorentz. 1/4 ENS de Lyon, Département Informatique, M1 CAP Lab #2 – Automne 2017 ex1 with ANTLR4 + Java : A very simple lexical analysis1 for simple arithmetic expressions of the form x+3. -
Problem Set 7 Solutions
Introduction to Algorithms November 18, 2005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 25 Problem Set 7 Solutions Problem 7-1. Edit distance In this problem you will write a program to compute edit distance. This problem is mandatory. Failure to turn in a solution will result in a serious and negative impact on your term grade! We advise you to start this programming assignment as soon as possible, because getting all the details right in a program can take longer than you think. Many word processors and keyword search engines have a spelling correction feature. If you type in a misspelled word x, the word processor or search engine can suggest a correction y. The correction y should be a word that is close to x. One way to measure the similarity in spelling between two text strings is by “edit distance.” The notion of edit distance is useful in other fields as well. For example, biologists use edit distance to characterize the similarity of DNA or protein sequences. The edit distance d(x; y) of two strings of text, x[1 : : m] and y[1 : : n], is defined to be the minimum possible cost of a sequence of “transformation operations” (defined below) that transforms string x[1 : : m] into string y[1 : : n].1 To define the effect of the transformation operations, we use an auxiliary string z[1 : : s] that holds the intermediate results. At the beginning of the transformation sequence, s = m and z[1 : : s] = x[1 : : m] (i.e., we start with string x[1 : : m]). -
Probabilistic Proof Systems: a Primer
Probabilistic Proof Systems: A Primer Oded Goldreich Department of Computer Science and Applied Mathematics Weizmann Institute of Science, Rehovot, Israel. June 30, 2008 Contents Preface 1 Conventions and Organization 3 1 Interactive Proof Systems 4 1.1 Motivation and Perspective ::::::::::::::::::::::: 4 1.1.1 A static object versus an interactive process :::::::::: 5 1.1.2 Prover and Veri¯er :::::::::::::::::::::::: 6 1.1.3 Completeness and Soundness :::::::::::::::::: 6 1.2 De¯nition ::::::::::::::::::::::::::::::::: 7 1.3 The Power of Interactive Proofs ::::::::::::::::::::: 9 1.3.1 A simple example :::::::::::::::::::::::: 9 1.3.2 The full power of interactive proofs ::::::::::::::: 11 1.4 Variants and ¯ner structure: an overview ::::::::::::::: 16 1.4.1 Arthur-Merlin games a.k.a public-coin proof systems ::::: 16 1.4.2 Interactive proof systems with two-sided error ::::::::: 16 1.4.3 A hierarchy of interactive proof systems :::::::::::: 17 1.4.4 Something completely di®erent ::::::::::::::::: 18 1.5 On computationally bounded provers: an overview :::::::::: 18 1.5.1 How powerful should the prover be? :::::::::::::: 19 1.5.2 Computational Soundness :::::::::::::::::::: 20 2 Zero-Knowledge Proof Systems 22 2.1 De¯nitional Issues :::::::::::::::::::::::::::: 23 2.1.1 A wider perspective: the simulation paradigm ::::::::: 23 2.1.2 The basic de¯nitions ::::::::::::::::::::::: 24 2.2 The Power of Zero-Knowledge :::::::::::::::::::::: 26 2.2.1 A simple example :::::::::::::::::::::::: 26 2.2.2 The full power of zero-knowledge proofs ::::::::::::