A Typed, Algebraic Approach to Parsing

Total Page:16

File Type:pdf, Size:1020Kb

A Typed, Algebraic Approach to Parsing A Typed, Algebraic Approach to Parsing Neel Krishnaswami Jeremy Yallop University of Cambridge University of Cambridge [email protected] [email protected] Abstract by composing higher-order functions. This allows building In this paper, we recall the definition of the context-free ex- up parsers with the ordinary abstraction features of the pressions (or µ-regular expressions), an algebraic presenta- programming language (variables, data structures and func- tion of the context-free languages. Then, we define a core tions), which is sufficiently beneficial that these libraries are type system for the context-free expressions which gives a popular in spite of the lack of a clear declarative reading compositional criterion for identifying those context-free ex- of the accepted language and bad worst-case performance pressions which can be parsed unambiguously by predictive (often exponential in the length of the input). algorithms in the style of recursive descent or LL(1). Naturally, we want to have our cake and eat it too: we Next, we show how these typed grammar expressions can want to combine the benefits of parser generators (efficiency be used to derive a parser combinator library which both and a clear declarative reading) with the benefits of parser guarantees linear-time parsing with no backtracking and combinators (ease of building high-level abstractions). The single-token lookahead, and which respects the natural de- two main difficulties are that the binding structure of BNF(a notational semantics of context-free expressions. Finally, we collection of globally-visible, mutually-recursive nontermi- show how to exploit the type information to write a staged nals) is at odds with the structure of programming languages version of this library, which produces dramatic increases (variables are lexically scoped), and that generating efficient in performance, even outperforming code generated by the parsers requires static analysis of the input grammar. standard parser generator tool ocamlyacc. This paper begins by recalling the old observation that the context-free languages can be understood as the extension 1 Introduction of regular expressions with variables and a least-fixed point operator (aka the “µ-regular expressions” [Leiß 1991]). These The theory of parsing is one of the oldest and most well- features replace the concept of nonterminal from BNF, and developed areas in computer science: the bibliography to so facilitate embedding context-free grammars into program- Grune and Jacobs’s Parsing Techniques: A Practical Guide ming languages. Our contributions are then: lists over 1700 references! Nevertheless, the foundations of the subject have remained remarkably stable: context-free • First, we extend the framework of µ-regular expres- languages are specified in Backus–Naur form, and parsers sions by giving them a semantic notion of type, build- for these specifications are implemented using algorithms ing on Brüggemann-Klein and Wood’s characteriza- derived from automata theory. This integration of theory and tion of unambiguous regular expressions [1992]. We practice has yielded many benefits: we have linear-time algo- then use this notion of type as the basis for a syntactic rithms for parsing unambiguous grammars efficiently [Knuth type system which checks whether a context-free ex- 1965; Lewis and Stearns 1968] and with excellent error re- pression is suitable for predictive (or recursive descent) porting for bad input strings [Jeffery 2003]. parsing—i.e., we give a type system for left-factored, However, in languages with good support for higher-order non-left-recursive, unambiguous grammars. We prove functions (such as ML, Haskell, and Scala) it is very popular that this type system is well-behaved (i.e., syntactic to use parser combinators [Hutton 1992] instead. These li- substitution preserves types, and sound with respect braries typically build backtracking recursive descent parsers to the denotational semantics), and that all well-typed grammars are unambiguous. Permission to make digital or hard copies of all or part of this work for • Next, we describe a parser combinator library for OCaml personal or classroom use is granted without fee provided that copies based upon this type system. This library exposes basi- are not made or distributed for profit or commercial advantage and that cally the standard applicative-style API with a higher- copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must order fixed point operator but internally it builds a be honored. Abstracting with credit is permitted. To copy otherwise, or first-order representation of well-typed, binding-safe republish, to post on servers or to redistribute to lists, requires prior specific grammars. Having a first-order representation avail- permission and/or a fee. Request permissions from [email protected]. able lets us reject untypeable grammars. PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA When a parser function is built from this AST, the © 2019 Copyright held by the owner/author(s). Publication rights licensed resulting parsers have extremely predictable perfor- to ACM. ACM ISBN 978-1-4503-6712-7/19/06...$15.00 mance: they are guaranteed to be linear-time and non- https://doi.org/10.1145/3314221.3314625 backtracking, using a single token of lookahead. In 1 PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Neel Krishnaswami and Jeremy Yallop addition, our API has no purely-operational features that our API is in the “monoidal” style (we give sequencing to block backtracking, so the simple reading of disjunc- the type seq: 'a t -> 'b t ->( 'a* 'b)t ), while theirs is in tion as union of languages and sequential composition the “applicative” style (i.e., seq:( 'a -> 'b)t -> 'a t -> 'b t). as concatenation of languages remains valid (unlike in However, as McBride and Paterson [2008] observe, the two other formalisms like packrat parsers). However, while styles are equivalent. the performance is predictable, it remains worse than These combinators build an abstract grammar, which can the performance of code generated by conventional be converted into an actual parser (i.e., a function from parser generators such as ocamlyacc. streams to values, raising an exception on strings not in • Finally, we build a staged version of this library using the grammar) using the parser function. MetaOCaml. Staging lets us entirely eliminate the ab- exception TypeError of string straction overhead of parser combinator libraries. The val parser: 'a t ->( char Stream.t -> 'a) grammar type system proves beneficial, as the types Critically, however, the parser function does not succeed on give very useful information in guiding the generation all abstract grammars: instead, it will raise a TypeError ex- of staged code (indeed, the output of staging looks ception if the grammar is ill-formed (i.e., ambiguous or left- very much like handwritten recursive descent parser recursive). As a result, parser is able to provide a stronger code). Our resulting parsers are very fast, typically guarantee when it does succeed: if it returns a parsing func- outperforming even ocamlyacc. Table-driven parsers tion, that function is guaranteed to parse in linear time with can be viewed as interpreters for a state machine, and single-token lookahead. staging lets us eliminate this interpretive overhead! 2.2 Examples 2 API Overview and Examples General Utilities One benefit of combinator parsing is Before getting into the technical details, we show the user- that it permits programmers to build up a library of abstrac- level API in our OCaml implementation, and give both ex- tions over the basic parsing primitives. Before showing how amples and non-examples of its use. this works, we will define a few utility functions and oper- ators to make the examples more readable. The expression 2.1 The High Level Interface always x is a constant function that always returns x: From a user perspective, we offer a very conventional com- let always x= fun _ ->x binator parsing API. We have an abstract type 'a t repre- We define (++) as an infix version of the seq function, and senting parsers which read a stream of tokens and produce (==>) as an infix map operator to resemble semantic actions a value of type 'a. Elements of this abstract type can be con- in traditional parser generators: structed with the following set of constants and functions: let (++)= seq type 'a t let (==>)pf= map f p val eps: unit t any val chr: char -> char t We also define , an n-ary version of the binary alternative val seq: 'a t -> 'b t ->( 'a* 'b)t alt, using a list fold: val bot: 'a t val any: 'a t list -> 'a list t val alt: 'a t -> 'a t -> 'a t let any gs= List.fold_left alt bot gs val fix:( 'b t -> 'b t) -> 'b t option r val map:( 'a -> 'b) -> 'a t -> 'b t The parser parses either the empty string or alter- natively anything that r parses: eps chr c Here is a parser matching the empty string; is a val option: 'a t -> 'a option t 1 parser matching the single character c ; and seq l r is a let option r= any[eps ==> always None; parser that parses strings with a prefix parsed by l and a r ==>( fun x -> Some x)] suffix parsed by r. The value bot is a parser that never suc- The operations can be combined with the fixed point op- ceeds, and alt l r is a parser that parses strings parsed either erator to define the Kleene star function. The parser star g by l or by r. Finally we have a fixed point operator fix for parses a list of the input parsed by g: either the empty list, 2 constructing recursive grammars .
Recommended publications
  • Derivatives of Parsing Expression Grammars
    Derivatives of Parsing Expression Grammars Aaron Moss Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada [email protected] This paper introduces a new derivative parsing algorithm for recognition of parsing expression gram- mars. Derivative parsing is shown to have a polynomial worst-case time bound, an improvement on the exponential bound of the recursive descent algorithm. This work also introduces asymptotic analysis based on inputs with a constant bound on both grammar nesting depth and number of back- tracking choices; derivative and recursive descent parsing are shown to run in linear time and constant space on this useful class of inputs, with both the theoretical bounds and the reasonability of the in- put class validated empirically. This common-case constant memory usage of derivative parsing is an improvement on the linear space required by the packrat algorithm. 1 Introduction Parsing expression grammars (PEGs) are a parsing formalism introduced by Ford [6]. Any LR(k) lan- guage can be represented as a PEG [7], but there are some non-context-free languages that may also be represented as PEGs (e.g. anbncn [7]). Unlike context-free grammars (CFGs), PEGs are unambiguous, admitting no more than one parse tree for any grammar and input. PEGs are a formalization of recursive descent parsers allowing limited backtracking and infinite lookahead; a string in the language of a PEG can be recognized in exponential time and linear space using a recursive descent algorithm, or linear time and space using the memoized packrat algorithm [6]. PEGs are formally defined and these algo- rithms outlined in Section 3.
    [Show full text]
  • CS 432 Fall 2020 Top-Down (LL) Parsing
    CS 432 Fall 2020 Mike Lam, Professor Top-Down (LL) Parsing Compilation Current focus "Back end" Source code Tokens Syntax tree Machine code char data[20]; 7f 45 4c 46 01 int main() { 01 01 00 00 00 float x 00 00 00 00 00 = 42.0; ... return 7; } Lexing Parsing Code Generation & Optimization "Front end" Review ● Recognize regular languages with finite automata – Described by regular expressions – Rule-based transitions, no memory required ● Recognize context-free languages with pushdown automata – Described by context-free grammars – Rule-based transitions, MEMORY REQUIRED ● Add a stack! Segue KEY OBSERVATION: Allowing the translator to use memory to track parse state information enables a wider range of automated machine translation. Chomsky Hierarchy of Languages Recursively enumerable Context-sensitive Context-free Most useful Regular for PL https://en.wikipedia.org/wiki/Chomsky_hierarchy Parsing Approaches ● Top-down: begin with start symbol (root of parse tree), and gradually expand non-terminals – Stack contains leaves that still need to be expanded ● Bottom-up: begin with terminals (leaves of parse tree), and gradually connect using non-terminals – Stack contains roots of subtrees that still need to be connected A V = E Top-down a E + E Bottom-up V V b c Top-Down Parsing root = createNode(S) focus = root A → V = E push(null) V → a | b | c token = nextToken() E → E + E loop: | V if (focus is non-terminal): B = chooseRuleAndExpand(focus) for each b in B.reverse(): focus.addChild(createNode(b)) push(b) A focus = pop() else if (token
    [Show full text]
  • Total Parser Combinators
    Total Parser Combinators Nils Anders Danielsson School of Computer Science, University of Nottingham, United Kingdom [email protected] Abstract The library has an interface which is very similar to those of A monadic parser combinator library which guarantees termination classical monadic parser combinator libraries. For instance, con- of parsing, while still allowing many forms of left recursion, is sider the following simple, left recursive, expression grammar: described. The library’s interface is similar to those of many other term :: factor term '+' factor parser combinator libraries, with two important differences: one is factor ::D atom j factor '*' atom that the interface clearly specifies which parts of the constructed atom ::D numberj '(' term ')' parsers may be infinite, and which parts have to be finite, using D j dependent types and a combination of induction and coinduction; We can define a parser which accepts strings from this grammar, and the other is that the parser type is unusually informative. and also computes the values of the resulting expressions, as fol- The library comes with a formal semantics, using which it is lows (the combinators are described in Section 4): proved that the parser combinators are as expressive as possible. mutual The implementation is supported by a machine-checked correct- term factor ness proof. D ] term >> λ n j D 1 ! Categories and Subject Descriptors D.1.1 [Programming Tech- tok '+' >> λ D ! niques]: Applicative (Functional) Programming; E.1 [Data Struc- factor >> λ n D 2 ! tures]; F.3.1 [Logics and Meanings of Programs]: Specifying and return .n n / 1 C 2 Verifying and Reasoning about Programs; F.4.2 [Mathematical factor atom Logic and Formal Languages]: Grammars and Other Rewriting D ] factor >> λ n Systems—Grammar types, Parsing 1 j tok '*' >>D λ ! D ! General Terms Languages, theory, verification atom >> λ n2 return .n n / D ! 1 ∗ 2 Keywords Dependent types, mixed induction and coinduction, atom number parser combinators, productivity, termination D tok '(' >> λ j D ! ] term >> λ n 1.
    [Show full text]
  • Backtrack Parsing Context-Free Grammar Context-Free Grammar
    Context-free Grammar Problems with Regular Context-free Grammar Language and Is English a regular language? Bad question! We do not even know what English is! Two eggs and bacon make(s) a big breakfast Backtrack Parsing Can you slide me the salt? He didn't ought to do that But—No! Martin Kay I put the wine you brought in the fridge I put the wine you brought for Sandy in the fridge Should we bring the wine you put in the fridge out Stanford University now? and University of the Saarland You said you thought nobody had the right to claim that they were above the law Martin Kay Context-free Grammar 1 Martin Kay Context-free Grammar 2 Problems with Regular Problems with Regular Language Language You said you thought nobody had the right to claim [You said you thought [nobody had the right [to claim that they were above the law that [they were above the law]]]] Martin Kay Context-free Grammar 3 Martin Kay Context-free Grammar 4 Problems with Regular Context-free Grammar Language Nonterminal symbols ~ grammatical categories Is English mophology a regular language? Bad question! We do not even know what English Terminal Symbols ~ words morphology is! They sell collectables of all sorts Productions ~ (unordered) (rewriting) rules This concerns unredecontaminatability Distinguished Symbol This really is an untiable knot. But—Probably! (Not sure about Swahili, though) Not all that important • Terminals and nonterminals are disjoint • Distinguished symbol Martin Kay Context-free Grammar 5 Martin Kay Context-free Grammar 6 Context-free Grammar Context-free
    [Show full text]
  • Adaptive LL(*) Parsing: the Power of Dynamic Analysis
    Adaptive LL(*) Parsing: The Power of Dynamic Analysis Terence Parr Sam Harwell Kathleen Fisher University of San Francisco University of Texas at Austin Tufts University [email protected] [email protected] kfi[email protected] Abstract PEGs are unambiguous by definition but have a quirk where Despite the advances made by modern parsing strategies such rule A ! a j ab (meaning “A matches either a or ab”) can never as PEG, LL(*), GLR, and GLL, parsing is not a solved prob- match ab since PEGs choose the first alternative that matches lem. Existing approaches suffer from a number of weaknesses, a prefix of the remaining input. Nested backtracking makes de- including difficulties supporting side-effecting embedded ac- bugging PEGs difficult. tions, slow and/or unpredictable performance, and counter- Second, side-effecting programmer-supplied actions (muta- intuitive matching strategies. This paper introduces the ALL(*) tors) like print statements should be avoided in any strategy that parsing strategy that combines the simplicity, efficiency, and continuously speculates (PEG) or supports multiple interpreta- predictability of conventional top-down LL(k) parsers with the tions of the input (GLL and GLR) because such actions may power of a GLR-like mechanism to make parsing decisions. never really take place [17]. (Though DParser [24] supports The critical innovation is to move grammar analysis to parse- “final” actions when the programmer is certain a reduction is time, which lets ALL(*) handle any non-left-recursive context- part of an unambiguous final parse.) Without side effects, ac- free grammar. ALL(*) is O(n4) in theory but consistently per- tions must buffer data for all interpretations in immutable data forms linearly on grammars used in practice, outperforming structures or provide undo actions.
    [Show full text]
  • Sequence Alignment/Map Format Specification
    Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345.
    [Show full text]
  • Lexing and Parsing with ANTLR4
    Lab 2 Lexing and Parsing with ANTLR4 Objective • Understand the software architecture of ANTLR4. • Be able to write simple grammars and correct grammar issues in ANTLR4. EXERCISE #1 Lab preparation Ï In the cap-labs directory: git pull will provide you all the necessary files for this lab in TP02. You also have to install ANTLR4. 2.1 User install for ANTLR4 and ANTLR4 Python runtime User installation steps: mkdir ~/lib cd ~/lib wget http://www.antlr.org/download/antlr-4.7-complete.jar pip3 install antlr4-python3-runtime --user Then in your .bashrc: export CLASSPATH=".:$HOME/lib/antlr-4.7-complete.jar:$CLASSPATH" export ANTLR4="java -jar $HOME/lib/antlr-4.7-complete.jar" alias antlr4="java -jar $HOME/lib/antlr-4.7-complete.jar" alias grun='java org.antlr.v4.gui.TestRig' Then source your .bashrc: source ~/.bashrc 2.2 Structure of a .g4 file and compilation Links to a bit of ANTLR4 syntax : • Lexical rules (extended regular expressions): https://github.com/antlr/antlr4/blob/4.7/doc/ lexer-rules.md • Parser rules (grammars) https://github.com/antlr/antlr4/blob/4.7/doc/parser-rules.md The compilation of a given .g4 (for the PYTHON back-end) is done by the following command line: java -jar ~/lib/antlr-4.7-complete.jar -Dlanguage=Python3 filename.g4 or if you modified your .bashrc properly: antlr4 -Dlanguage=Python3 filename.g4 2.3 Simple examples with ANTLR4 EXERCISE #2 Demo files Ï Work your way through the five examples in the directory demo_files: Aurore Alcolei, Laure Gonnord, Valentin Lorentz. 1/4 ENS de Lyon, Département Informatique, M1 CAP Lab #2 – Automne 2017 ex1 with ANTLR4 + Java : A very simple lexical analysis1 for simple arithmetic expressions of the form x+3.
    [Show full text]
  • Staged Parser Combinators for Efficient Data Processing
    Staged Parser Combinators for Efficient Data Processing Manohar Jonnalagedda∗ Thierry Coppeyz Sandro Stucki∗ Tiark Rompf y∗ Martin Odersky∗ ∗LAMP zDATA, EPFL {first.last}@epfl.ch yOracle Labs: {first.last}@oracle.com Abstract use the language’s abstraction capabilities to enable compo- Parsers are ubiquitous in computing, and many applications sition. As a result, a parser written with such a library can depend on their performance for decoding data efficiently. look like formal grammar descriptions, and is also readily Parser combinators are an intuitive tool for writing parsers: executable: by construction, it is well-structured, and eas- tight integration with the host language enables grammar ily maintainable. Moreover, since combinators are just func- specifications to be interleaved with processing of parse re- tions in the host language, it is easy to combine them into sults. Unfortunately, parser combinators are typically slow larger, more powerful combinators. due to the high overhead of the host language abstraction However, parser combinators suffer from extremely poor mechanisms that enable composition. performance (see Section 5) inherent to their implementa- We present a technique for eliminating such overhead. We tion. There is a heavy penalty to be paid for the expres- use staging, a form of runtime code generation, to dissoci- sivity that they allow. A grammar description is, despite its ate input parsing from parser composition, and eliminate in- declarative appearance, operationally interleaved with input termediate data structures and computations associated with handling, such that parts of the grammar description are re- parser composition at staging time. A key challenge is to built over and over again while input is processed.
    [Show full text]
  • Compiler Design
    CCOOMMPPIILLEERR DDEESSIIGGNN -- PPAARRSSEERR http://www.tutorialspoint.com/compiler_design/compiler_design_parser.htm Copyright © tutorialspoint.com In the previous chapter, we understood the basic concepts involved in parsing. In this chapter, we will learn the various types of parser construction methods available. Parsing can be defined as top-down or bottom-up based on how the parse-tree is constructed. Top-Down Parsing We have learnt in the last chapter that the top-down parsing technique parses the input, and starts constructing a parse tree from the root node gradually moving down to the leaf nodes. The types of top-down parsing are depicted below: Recursive Descent Parsing Recursive descent is a top-down parsing technique that constructs the parse tree from the top and the input is read from left to right. It uses procedures for every terminal and non-terminal entity. This parsing technique recursively parses the input to make a parse tree, which may or may not require back-tracking. But the grammar associated with it ifnotleftfactored cannot avoid back- tracking. A form of recursive-descent parsing that does not require any back-tracking is known as predictive parsing. This parsing technique is regarded recursive as it uses context-free grammar which is recursive in nature. Back-tracking Top- down parsers start from the root node startsymbol and match the input string against the production rules to replace them ifmatched. To understand this, take the following example of CFG: S → rXd | rZd X → oa | ea Z → ai For an input string: read, a top-down parser, will behave like this: It will start with S from the production rules and will match its yield to the left-most letter of the input, i.e.
    [Show full text]
  • 7.2 Finite State Machines Finite State Machines Are Another Way to Specify the Syntax of a Sentence in a Lan- Guage
    32397_CH07_331_388.qxd 1/14/09 5:01 PM Page 346 346 Chapter 7 Language Translation Principles Context Sensitivity of C++ It appears from Figure 7.8 that the C++ language is context-free. Every production rule has only a single nonterminal on the left side. This is in contrast to a context- sensitive grammar, which can have more than a single nonterminal on the left, as in Figure 7.3. Appearances are deceiving. Even though the grammar for this subset of C++, as well as the full standard C++ language, is context-free, the language itself has some context-sensitive aspects. Consider the grammar in Figure 7.3. How do its rules of production guarantee that the number of c’s at the end of a string must equal the number of a’s at the beginning of the string? Rules 1 and 2 guarantee that for each a generated, exactly one C will be generated. Rule 3 lets the C commute to the right of B. Finally, Rule 5 lets you substitute c for C in the context of having a b to the left of C. The lan- guage could not be specified by a context-free grammar because it needs Rules 3 and 5 to get the C’s to the end of the string. There are context-sensitive aspects of the C++ language that Figure 7.8 does not specify. For example, the definition of <parameter-list> allows any number of formal parameters, and the definition of <argument-expression-list> allows any number of actual parameters.
    [Show full text]
  • 2D Trie for Fast Parsing
    2D Trie for Fast Parsing Xian Qian, Qi Zhang, Xuanjing Huang, Lide Wu Institute of Media Computing School of Computer Science, Fudan University xianqian, qz, xjhuang, ldwu @fudan.edu.cn { } Abstract FeatureGeneration Template: Feature: p .word+p .pos lucky/ADJ In practical applications, decoding speed 0 0 is very important. Modern structured Feature Retrieval learning technique adopts template based Index: method to extract millions of features. Parse Tree 3228~3233 Complicated templates bring about abun- Buildlattice,inferenceetc. dant features which lead to higher accu- racy but more feature extraction time. We Figure 1: Flow chart of dependency parsing. propose Two Dimensional Trie (2D Trie), p0.word, p0.pos denotes the word and POS tag a novel efficient feature indexing structure of parent node respectively. Indexes correspond which takes advantage of relationship be- to the features conjoined with dependency types, tween templates: feature strings generated e.g., lucky/ADJ/OBJ, lucky/ADJ/NMOD, etc. by a template are prefixes of the features from its extended templates. We apply our technique to Maximum Spanning Tree dependency parsing. Experimental results feature extraction beside of a fast parsing algo- on Chinese Tree Bank corpus show that rithm is important for the parsing and training our 2D Trie is about 5 times faster than speed. He takes 3 measures for a 40X speedup, traditional Trie structure, making parsing despite the same inference algorithm. One impor- speed 4.3 times faster. tant measure is to store the feature vectors in file to skip feature extraction, otherwise it will be the 1 Introduction bottleneck. Now we quickly review the feature extraction In practical applications, decoding speed is very stage of structured learning.
    [Show full text]
  • The String Edit Distance Matching Problem with Moves
    The String Edit Distance Matching Problem with Moves GRAHAM CORMODE AT&T Labs–Research and S. MUTHUKRISHNAN Rutgers University The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. We relax the problem so that (a) we allow an additional operation, namely, sub- string moves, and (b) we allow approximation of this string edit distance. Our result is a near linear time deterministic algorithm to produce a factor of O(log n log∗ n) approximation to the string edit distance with moves. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embed- ding strings into L1 vector space using a simplified parsing technique we call Edit Sensitive Parsing (ESP). Categories and Subject Descriptors: F.2.0 [Analysis of Algorithms and Problem Complex- ity]: General General Terms: Algorithms, Theory Additional Key Words and Phrases: approximate pattern matching, data streams, edit distance, embedding, similarity search, string matching 1. INTRODUCTION String matching has a long history in computer science, dating back to the first compilers in the sixties and before. Text comparison now appears in all areas of the discipline, from compression and pattern matching to computational biology Author’s addres: G.
    [Show full text]