An Efficient Parser Generator for Natural Language

Total Page:16

File Type:pdf, Size:1020Kb

An Efficient Parser Generator for Natural Language An ENcient Parser Gener;ttor fl)r Nat;rea,1 Language Masayuki ISItlI* l(azuhisa OHTA Iliro~dd SAITO Fujitsu Inc. Apple Technology, Inc. Keio University masayuki~.nak.math.keio.ac.j p k-ohta@kol)o.apple.coln hxs~nak.math.keio.ac.jp Abstract splitting the parse stack fin' each action. The parser merges the divided sta.ck br;tnches, only We. have developed a parser generator for natu- when they have the same top state. This merger ral language i)rocessing. The generator named operation is important for efficiency. As a re- "NLyace" accepts grammar rules written in the suit, the stacl( becomes a. gra.plt instead of a Yacc format. NLyacc, unlike Yacc, can handle simph,, linear state sequence. arbitrary context-free grammars using the gen- There is already a generalized LR parser eralized Lll. parsing Mgorithm. The parser pro- for natural language processing developed at duced by NLyacc elliciently parses given sen- Carnegie Mellon University [3]. Nl,yacc diflhrs tences and executes semantic actions. NLyacc, fi'om CMU's system in the following points. which is a free and sharable software, runs on UNIX workstations and personal computers. • NLyacc is written in C, while CMU's in 1 Parser Generator for NLP Lisp. Yacc[4] was designed for unambiguous progl'anl- ming languages. Thus, Yacc cat) not elegantly • CMU's cannot handh', c rules, while NI,y- handle a script language with a natural lan- ace does. c rules are handful for writing guage flavor, i.e. Yacc forces a grammar writer natural rules. to use tricks for handling ambiguities. To rem- edy this situation we have developed Nl,yacc which can handle arbitrary context-fi'ee gr;tnl- The way to execute semantic actions dif- mars t and allows a grammar writer to write fers. CMU's evaluates an Ll?(]-like se- natural rules and semantic actions. Although mantic action attached to each rule when there are several parsing algorithms for a gen- reduce action is performed on that rule. eral context-fi'ee language, such as ATN, CYI(, NLyacc executes a semantic action in two and garley, "the generalized Eli. parsing algo- levels; one is perfin'med during parsing rithm [2]" would be the best in terms of its for syntactic control and the. other is per- compatibility with Yacc and its efficiency. formed onto each successfifl final p;~rse. We An ambiguous grammar causes a conflict in will desc.ribe the details of NLyacc's ap- the parsing table, a state which has more than proach in the next section. one action in an entry. The. generalized LR parsing proceeds exactly tit(.' same way as the stm~dard one except when it encounters a con- NLyacc is ,,pper-compatible to Yacc. NLy- flict. The standard deterministic LR parser acc consists of three modules; a reader, a pars- chooses only one action in this situation. The ing table constructor, and a drive routine for generalized I,R parser, on the other hand, per- the gene.ralized LR parsing. The reader accepts forms all the actions in the multiple entry by grammar ruh;s in the Yacc format. The table constructor produces a generalized LR. parsing *This work was done while lshil stayed at l)ept, of Computer Science, Keio University, Japan. t;tble instead of the standard I,R. parsing table. 1To be exact, NLyacc ca,t not handle ;t circular rule We describe the de.tails of the parser in the next like "A --+ A". sectiou. 417 2 Execution of Semantic Ac- above during parsing. After parsing is done, tions the semantic action evMuator performs the task as it traces all the history paths one by one. NLyacc differs from Yacc mainly in the exe- This approach retains parsing efficiency and can cution process of semantic actions attached to avoid the execution of useless semantic actions. each grammar rule. Namely, Yacc evaluates a A drawback of this approach is that semantic semantic action a.q it parses the input. We ex- actions can not control the syntactic parsing, amine if this evaluation mechanism is suitable because actions are not evaluated until tile pars- for the generalized LR. parsing here. If we can ing is clone. To compensate the cons above, we assume that there is only one final parse, the have introduced a new semantic action enclosed parser can ewtluate semantic actions when only with [ ] to enable a user to discard semantically one branch exists on top of the stack. Although incorrect parses in the middle of parsing. having only one final parse is often the cruse in Namely, there are two types of semantic ac- practical applications, the constraint of being tions: unambiguous is too strong in generM. An action enclosed with [ ] is executed during parsing .just as done in Yacc. If 2.1 Handling Side Effects 'return 0;' is execute<t in the action, the Next, we examine what would happen if seman- partial parse having invoked this action tic actions are executed during parsing. When fails and is disca.rded. a reduce action is performed, the parser eval- uates the action attached to the current rule. * An action enclosed with { ) is executed al- As described in the previous section, the parse ter the syntactic parsing. stack grows in a graph form. Thus, when the action contains side effects like an assignment In the example below, the bracketed action operation to a variable shared by different ac- checks if the subtraction result is negative, and, tions, that side effect must not propagate to tile if true, discar<ts its partial parse. other paths in the graph. If an environment, which is a set of v,zdue of number : number '-' number variables, is prepared to each path of the parse [ $$ = $1-33; if(35 < 0) return 0; ] branches, such side effect can be encapsulated. { $$ = 31-33; print( ..... , 31, $3, $$); } When a stack splits, a copy of the environment should be created for each branch. When the 2.3 Keeping Parse History parse branches are merged, however, each en- vironment can not be merged. Instead, the Our generalized Lll. parsing algorithm is differ- merged state must have all the environments. ent from tile original one [2] in that ore' algo- Thus, the number of environments grows expo- rithm keeps a history of parse actions to exe- nentially as parsing proceeds. Therefore this cute semantic actions after the syntactle pars- approach decreases the parsing e[Iiciency dras- ing. The original algorithm uses a packe<l forest tically. Also this high cost operation would be representation for the stack, whereas our algo- vain when the parse fails in the middle. To rithm uses a list representation. sum it up, although this approach retains com- The algorithm of keeping the parse history is patibility with Yacc, it sacrifices efficiency too shown as follows. much. 1) If the next action is "shift s", then make < s > as the history, where < s > is a list of only one element s. 2.2 Two Kinds of Semantic Actions 2) If the next action is "reduce r : A -+ BIB2 We, therefore, take another approach to han- "".11~", then make append(lh, lI2, ..., IIn, l-r]) dling semantic actions in NLyacc. Namely, the as the history, where Hi is a history of Bi, r parser just keeps a list of actions to be exe- is the rule number of production "A -+ 1~1132 cuted, and performs all the actions after pars- • "1],/', an<l the function 'append' concatenates ing is done. This method can avoid the problem multiple lists and returns the result. 418 Now we describe how to execute semantic ac- [3] M. Tomita and J. G. Carbonell. The universal tions using the parse history. First, before start- parser architecture for knowledge-based machine ing to parse, the parser ca.lls "yyinit" function translation. In Proceedings, lOlh hdcvaational to initialize wtriables in the semantic actions. Joint Um~ference on Arlificial IMelligence (IJ- CAI), Milan, A,gust 1987. Our system requires the. user to define "yyinit" to set initial values to the variables. Next, the [J] yacc - yet another compiler-compiler: l)arsing parser starts parsing and l)erforms a shift ac- l)rogram generator, in UNLV manual. tion or a reduce action according to the parse history and evaluates the apl)ropriate semantic Appendix - Sample Runs - actions. A sa,mple grammar helow covers a sm~fll set of 2.4 Efficient Memory Management l'~nglish sentences. The. parser I)ro(h:,ees syntac- tic trees ofagiven sentence. Agreement check We use a list structure to implement the. parse is done by the semantic actions. stack, because the stack becomes a complex grN)h structure as described l)reviously. Be- /* grml~ar.y */ cause the parser discards fa.iled branches of the %{ stack, the system rechfims the memory allo- #include <stdio.h> cated for the discarded parses using the "mark #include <stdlib.h> and sweep garhage collection algorithm [1]" to #include "gran~ar,h" #include "proto.h" use memory efficiently. 'Phis garl)age collection %} is triggered only when the memory is exhausted in our current implementation. %token NOUN VERB DET PREP %% 3 Distribution SS : S { pr_tree($1); } Portability S : NP VP [ return checkl($1, $2); ] Currently, NLyacc runs on UNIX worksta.- { $$ = mk_tree2("S", $1, $2); } tions and DOS personal computers. S : S PP { $5 = mk_tree2("S", $1, $2); } Debugging Grammars For grammar debugging, NLyacc provides NP : NOUN [ $$ : $1; ] l)arse trace information such as a history of { $$ : mk treel("NP", $I); } shift/reduce actions, execution information of '[] actions.' NP : DET NOUN [ $$ = $2; return check2($1, $2);] When NLya.cc encounters an error state, { 55 = mk tree2("NP", $I, $2); } "yyerror" function is called just a.s in Yacc.
Recommended publications
  • Buffered Shift-Reduce Parsing*
    BUFFERED SHIFT-REDUCE PARSING* Bing Swen (Sun Bin) Dept. of Computer Science, Peking University, Beijing 100871, China [email protected] Abstract A parsing method called buffered shift-reduce parsing is presented, which adds an intermediate buffer (queue) to the usual LR parser. The buffer’s usage is analogous to that of the wait-and-see parsing, but it has unlimited buffer length, and may serve as a separate reduction (pruning) stack. The general structure of its parser and some features of its grammars and parsing tables are discussed. 1. Introduction It is well known that the LR parsing is hitherto the most general shift-reduce method of bottom-up parsing [1], and is among the most widely used. For example, almost all compilers of mainstream programming languages employ the LR-like parsing (via an LALR(1) compiler generator such as YACC or GNU Bison [2]), and many NLP systems use a generalized version of LR parsing (GLR, also called the Tomita algorithm [3]). In some earlier research work on LR(k) extensions for NLP [4] and generalization of compiler generator [5], an idea naturally occurred that one can make a modification to the LR parsing so that after each reduction, the resulting symbol may not be necessarily pushed onto the analysis stack, but instead put back into the input, waiting for decisions in the following steps. This strategy postpones the time of some reduction actions in traditional LR parser, partly deviating from leftmost pruning (Leftmost Reduction). This may cause performance loss for some grammars, with the gained opportunities of earlier error detecting and avoidance of false reductions, as well as later handling of ambiguities, which is significant for the processing of highly ambiguous input strings like natural language sentences.
    [Show full text]
  • Efficient Computation of LALR(1) Look-Ahead Sets
    Efficient Computation of LALR(1) Look-Ahead Sets FRANK DeREMER and THOMAS PENNELLO University of California, Santa Cruz, and MetaWare TM Incorporated Two relations that capture the essential structure of the problem of computing LALR(1) look-ahead sets are defined, and an efficient algorithm is presented to compute the sets in time linear in the size of the relations. In particular, for a PASCAL grammar, the algorithm performs fewer than 15 percent of the set unions performed by the popular compiler-compiler YACC. When a grammar is not LALR(1), the relations, represented explicitly, provide for printing user- oriented error messages that specifically indicate how the look-ahead problem arose. In addition, certain loops in the digraphs induced by these relations indicate that the grammar is not LR(k) for any k. Finally, an oft-discovered and used but incorrect look-ahead set algorithm is similarly based on two other relations defined for the fwst time here. The formal presentation of this algorithm should help prevent its rediscovery. Categories and Subject Descriptors: D.3.1 [Programming Languages]: Formal Definitions and Theory--syntax; D.3.4 [Programming Languages]: Processors--translator writing systems and compiler generators; F.4.2 [Mathematical Logic and Formal Languages]: Grammars and Other Rewriting Systems--parsing General Terms: Algorithms, Languages, Theory Additional Key Words and Phrases: Context-free grammar, Backus-Naur form, strongly connected component, LALR(1), LR(k), grammar debugging, 1. INTRODUCTION Since the invention of LALR(1) grammars by DeRemer [9], LALR grammar analysis and parsing techniques have been popular as a component of translator writing systems and compiler-compilers.
    [Show full text]
  • LLLR Parsing: a Combination of LL and LR Parsing
    LLLR Parsing: a Combination of LL and LR Parsing Boštjan Slivnik University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia [email protected] Abstract A new parsing method called LLLR parsing is defined and a method for producing LLLR parsers is described. An LLLR parser uses an LL parser as its backbone and parses as much of its input string using LL parsing as possible. To resolve LL conflicts it triggers small embedded LR parsers. An embedded LR parser starts parsing the remaining input and once the LL conflict is resolved, the LR parser produces the left parse of the substring it has just parsed and passes the control back to the backbone LL parser. The LLLR(k) parser can be constructed for any LR(k) grammar. It produces the left parse of the input string without any backtracking and, if used for a syntax-directed translation, it evaluates semantic actions using the top-down strategy just like the canonical LL(k) parser. An LLLR(k) parser is appropriate for grammars where the LL(k) conflicting nonterminals either appear relatively close to the bottom of the derivation trees or produce short substrings. In such cases an LLLR parser can perform a significantly better error recovery than an LR parser since the most part of the input string is parsed with the backbone LL parser. LLLR parsing is similar to LL(∗) parsing except that it (a) uses LR(k) parsers instead of finite automata to resolve the LL(k) conflicts and (b) does not perform any backtracking.
    [Show full text]
  • Parser Generation Bottom-Up Parsing LR Parsing Constructing LR Parser
    CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Parser Generation Bottom-Up Parsing • Main Problem: given a grammar G, how to build a top-down parser or • Construct parse tree “bottom-up” --- from leaves to the root a bottom-up parser for it ? • Bottom-up parsing always constructs right-most derivation • parser : a program that, given a sentence, reconstructs a derivation for • Important parsing algorithms: shift-reduce, LR parsing that sentence ---- if done sucessfully, it “recognize” the sentence • LR parser components: input, stack (strings of grammar symbols and • all parsers read their input left-to-right, but construct parse tree states), driver routine, parsing tables. differently. input: a1 a2 a3 a4 ......an $ • bottom-up parsers --- construct the tree from leaves to root stack s shift-reduce, LR, SLR, LALR, operator precedence m LR Parsing output Xm • top-down parsers --- construct the tree from root to leaves .... s1 recursive descent, predictive parsing, LL(1) X1 parsing s0 Copyright 1994 - 2000 Carsten Schürmann, Zhong Shao, Yale University Parser Generation: Page 1 of 28 Copyright 1994 - 2000 Carsten Schürmann, Zhong Shao, Yale University Parser Generation: Page 2 of 28 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS LR Parsing Constructing LR Parser How to construct the parsing table and ? • A sequence of new state symbols s0,s1,s2,..., sm ----- each state ACTION GOTO sumarize the information contained in the stack below it. • basic idea: first construct DFA to recognize handles, then use DFA to construct the parsing tables ! different parsing table yield different LR • Parsing configurations: (stack, remaining input) written as parsers SLR(1), LR(1), or LALR(1) (s0X1s1X2s2...Xmsm , aiai+1ai+2...an$) • augmented grammar for context-free grammar G = G(T,N,P,S) is next “move” is determined by sm and ai defined as G’ = G’(T,N ??{ S’}, P ??{ S’ -> S}, S’) ------ adding non- • Parsing tables: ACTION[s,a] and GOTO[s,X] terminal S’ and the production S’ -> S, and S’ is the new start symbol.
    [Show full text]
  • Compiler Design
    CCOOMMPPIILLEERR DDEESSIIGGNN -- PPAARRSSEERR http://www.tutorialspoint.com/compiler_design/compiler_design_parser.htm Copyright © tutorialspoint.com In the previous chapter, we understood the basic concepts involved in parsing. In this chapter, we will learn the various types of parser construction methods available. Parsing can be defined as top-down or bottom-up based on how the parse-tree is constructed. Top-Down Parsing We have learnt in the last chapter that the top-down parsing technique parses the input, and starts constructing a parse tree from the root node gradually moving down to the leaf nodes. The types of top-down parsing are depicted below: Recursive Descent Parsing Recursive descent is a top-down parsing technique that constructs the parse tree from the top and the input is read from left to right. It uses procedures for every terminal and non-terminal entity. This parsing technique recursively parses the input to make a parse tree, which may or may not require back-tracking. But the grammar associated with it ifnotleftfactored cannot avoid back- tracking. A form of recursive-descent parsing that does not require any back-tracking is known as predictive parsing. This parsing technique is regarded recursive as it uses context-free grammar which is recursive in nature. Back-tracking Top- down parsers start from the root node startsymbol and match the input string against the production rules to replace them ifmatched. To understand this, take the following example of CFG: S → rXd | rZd X → oa | ea Z → ai For an input string: read, a top-down parser, will behave like this: It will start with S from the production rules and will match its yield to the left-most letter of the input, i.e.
    [Show full text]
  • A Simple, Possibly Correct LR Parser for C11 Jacques-Henri Jourdan, François Pottier
    A Simple, Possibly Correct LR Parser for C11 Jacques-Henri Jourdan, François Pottier To cite this version: Jacques-Henri Jourdan, François Pottier. A Simple, Possibly Correct LR Parser for C11. ACM Transactions on Programming Languages and Systems (TOPLAS), ACM, 2017, 39 (4), pp.1 - 36. 10.1145/3064848. hal-01633123 HAL Id: hal-01633123 https://hal.archives-ouvertes.fr/hal-01633123 Submitted on 11 Nov 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 14 A simple, possibly correct LR parser for C11 Jacques-Henri Jourdan, Inria Paris, MPI-SWS François Pottier, Inria Paris The syntax of the C programming language is described in the C11 standard by an ambiguous context-free grammar, accompanied with English prose that describes the concept of “scope” and indicates how certain ambiguous code fragments should be interpreted. Based on these elements, the problem of implementing a compliant C11 parser is not entirely trivial. We review the main sources of difficulty and describe a relatively simple solution to the problem. Our solution employs the well-known technique of combining an LALR(1) parser with a “lexical feedback” mechanism. It draws on folklore knowledge and adds several original as- pects, including: a twist on lexical feedback that allows a smooth interaction with lookahead; a simplified and powerful treatment of scopes; and a few amendments in the grammar.
    [Show full text]
  • CLR(1) and LALR(1) Parsers
    CLR(1) and LALR(1) Parsers C.Naga Raju B.Tech(CSE),M.Tech(CSE),PhD(CSE),MIEEE,MCSI,MISTE Professor Department of CSE YSR Engineering College of YVU Proddatur 6/21/2020 Prof.C.NagaRaju YSREC of YVU 9949218570 Contents • Limitations of SLR(1) Parser • Introduction to CLR(1) Parser • Animated example • Limitation of CLR(1) • LALR(1) Parser Animated example • GATE Problems and Solutions Prof.C.NagaRaju YSREC of YVU 9949218570 Drawbacks of SLR(1) ❖ The SLR Parser discussed in the earlier class has certain flaws. ❖ 1.On single input, State may be included a Final Item and a Non- Final Item. This may result in a Shift-Reduce Conflict . ❖ 2.A State may be included Two Different Final Items. This might result in a Reduce-Reduce Conflict Prof.C.NagaRaju YSREC of YVU 6/21/2020 9949218570 ❖ 3.SLR(1) Parser reduces only when the next token is in Follow of the left-hand side of the production. ❖ 4.SLR(1) can reduce shift-reduce conflicts but not reduce-reduce conflicts ❖ These two conflicts are reduced by CLR(1) Parser by keeping track of lookahead information in the states of the parser. ❖ This is also called as LR(1) grammar Prof.C.NagaRaju YSREC of YVU 6/21/2020 9949218570 CLR(1) Parser ❖ LR(1) Parser greatly increases the strength of the parser, but also the size of its parse tables. ❖ The LR(1) techniques does not rely on FOLLOW sets, but it keeps the Specific Look-ahead with each item. Prof.C.NagaRaju YSREC of YVU 6/21/2020 9949218570 CLR(1) Parser ❖ LR(1) Parsing configurations have the general form: A –> X1...Xi • Xi+1...Xj , a ❖ The Look Ahead
    [Show full text]
  • Packrat Parsing: Simple, Powerful, Lazy, Linear Time
    Packrat Parsing: Simple, Powerful, Lazy, Linear Time Functional Pearl Bryan Ford Massachusetts Institute of Technology Cambridge, MA [email protected] Abstract 1 Introduction Packrat parsing is a novel technique for implementing parsers in a There are many ways to implement a parser in a functional pro- lazy functional programming language. A packrat parser provides gramming language. The simplest and most direct approach is the power and flexibility of top-down parsing with backtracking and top-down or recursive descent parsing, in which the components unlimited lookahead, but nevertheless guarantees linear parse time. of a language grammar are translated more-or-less directly into a Any language defined by an LL(k) or LR(k) grammar can be rec- set of mutually recursive functions. Top-down parsers can in turn ognized by a packrat parser, in addition to many languages that be divided into two categories. Predictive parsers attempt to pre- conventional linear-time algorithms do not support. This additional dict what type of language construct to expect at a given point by power simplifies the handling of common syntactic idioms such as “looking ahead” a limited number of symbols in the input stream. the widespread but troublesome longest-match rule, enables the use Backtracking parsers instead make decisions speculatively by try- of sophisticated disambiguation strategies such as syntactic and se- ing different alternatives in succession: if one alternative fails to mantic predicates, provides better grammar composition properties, match, then the parser “backtracks” to the original input position and allows lexical analysis to be integrated seamlessly into parsing. and tries another.
    [Show full text]
  • Parser Generation Bottom-Up Parsing LR Parsing Constructing LR Parser
    CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Parser Generation Bottom-Up Parsing • Main Problem: given a grammar G, how to build a top-down parser or a • Construct parse tree “bottom-up” --- from leaves to the root bottom-up parser for it ? • Bottom-up parsing always constructs right-most derivation • parser : a program that, given a sentence, reconstructs a derivation for that • Important parsing algorithms: shift-reduce, LR parsing sentence ---- if done sucessfully, it “recognize” the sentence • LR parser components: input, stack (strings of grammar symbols and states), • all parsers read their input left-to-right, but construct parse tree differently. driver routine, parsing tables. • bottom-up parsers --- construct the tree from leaves to root input: a1 a2 a3 a4 ...... an $ shift-reduce, LR, SLR, LALR, operator precedence stack s m LR Parsing output • top-down parsers --- construct the tree from root to leaves Xm .. recursive descent, predictive parsing, LL(1) s1 X1 s0 parsing Copyright 1994 - 2015 Zhong Shao, Yale University Parser Generation: Page 1 of 27 Copyright 1994 - 2015 Zhong Shao, Yale University Parser Generation: Page 2 of 27 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS LR Parsing Constructing LR Parser How to construct the parsing table and ? • A sequence of new state symbols s0,s1,s2,..., sm ----- each state ACTION GOTO sumarize the information contained in the stack below it. • basic idea: first construct DFA to recognize handles, then use DFA to construct the parsing tables ! different parsing table yield different LR parsers SLR(1), • Parsing configurations: (stack, remaining input) written as LR(1), or LALR(1) (s0X1s1X2s2...Xmsm , aiai+1ai+2...an$) • augmented grammar for context-free grammar G = G(T,N,P,S) is defined as G’ next “move” is determined by sm and ai = G’(T, N { S’}, P { S’ -> S}, S’) ------ adding non-terminal S’ and the • Parsing tables: ACTION[s,a] and GOTO[s,X] production S’ -> S, and S’ is the new start symbol.
    [Show full text]
  • A Simple, Possibly Correct LR Parser for C11
    14 A simple, possibly correct LR parser for C11 Jacques-Henri Jourdan, Inria Paris, MPI-SWS François Pottier, Inria Paris The syntax of the C programming language is described in the C11 standard by an ambiguous context-free grammar, accompanied with English prose that describes the concept of “scope” and indicates how certain ambiguous code fragments should be interpreted. Based on these elements, the problem of implementing a compliant C11 parser is not entirely trivial. We review the main sources of difficulty and describe a relatively simple solution to the problem. Our solution employs the well-known technique of combining an LALR(1) parser with a “lexical feedback” mechanism. It draws on folklore knowledge and adds several original as- pects, including: a twist on lexical feedback that allows a smooth interaction with lookahead; a simplified and powerful treatment of scopes; and a few amendments in the grammar. Although not formally verified, our parser avoids several pitfalls that other implementations have fallen prey to. We believe that its simplicity, its mostly-declarative nature, and its high similarity with the C11 grammar are strong informal arguments in favor of its correctness. Our parser is accompanied with a small suite of “tricky” C11 programs. We hope that it may serve as a reference or a starting point in the implementation of compilers and analysis tools. CCS Concepts: •Software and its engineering ! Parsers; Additional Key Words and Phrases: Compilation; parsing; ambiguity; lexical feedback; C89; C99; C11 ACM Reference Format: Jacques-Henri Jourdan and François Pottier. 2017. A simple, possibly correct LR parser for C11 ACM Trans.
    [Show full text]
  • The Copyright Law of the United States (Title 17, U.S
    NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material. Any copying of this document without permission of its author may be prohibited by law. CMU-CS-84-163 An Efficient All-paths Parsing Algorithm for Natural Languages Masaru Tornita Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213 25 October 1984 Abstract An extended LR parsing algorithm is introduced and its application to natural language processing is discussed. Unlike the standard LR, our algorithm is capable of handling arbitrary context-free phrase structure grammars including ambiguous grammars, while most of the LR parsing efficiency is preserved. When an input sentence is ambiguous, it produces all possible parses in an efficient manner with the idea of a "graph-structured stack." Comparisons with other parsing methods are made. This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 3597, monitored by the Air Force Avionics Laboratory Under Contract F33615-81 -K-1539. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the US Government. i Table of Contents 1. Introduction 1 2. LR parsing 3 2.1. An example 3 2.2. Problem in Applying to Natural Languages 5 3. MLR parsing 7 3.1. With Stack List 7 3.2. With a Tree-structured Stack 9 3.3. With a Graph-structured Stack 10 4.
    [Show full text]
  • Compiler Construction
    Compiler construction PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 10 Dec 2011 02:23:02 UTC Contents Articles Introduction 1 Compiler construction 1 Compiler 2 Interpreter 10 History of compiler writing 14 Lexical analysis 22 Lexical analysis 22 Regular expression 26 Regular expression examples 37 Finite-state machine 41 Preprocessor 51 Syntactic analysis 54 Parsing 54 Lookahead 58 Symbol table 61 Abstract syntax 63 Abstract syntax tree 64 Context-free grammar 65 Terminal and nonterminal symbols 77 Left recursion 79 Backus–Naur Form 83 Extended Backus–Naur Form 86 TBNF 91 Top-down parsing 91 Recursive descent parser 93 Tail recursive parser 98 Parsing expression grammar 100 LL parser 106 LR parser 114 Parsing table 123 Simple LR parser 125 Canonical LR parser 127 GLR parser 129 LALR parser 130 Recursive ascent parser 133 Parser combinator 140 Bottom-up parsing 143 Chomsky normal form 148 CYK algorithm 150 Simple precedence grammar 153 Simple precedence parser 154 Operator-precedence grammar 156 Operator-precedence parser 159 Shunting-yard algorithm 163 Chart parser 173 Earley parser 174 The lexer hack 178 Scannerless parsing 180 Semantic analysis 182 Attribute grammar 182 L-attributed grammar 184 LR-attributed grammar 185 S-attributed grammar 185 ECLR-attributed grammar 186 Intermediate language 186 Control flow graph 188 Basic block 190 Call graph 192 Data-flow analysis 195 Use-define chain 201 Live variable analysis 204 Reaching definition 206 Three address
    [Show full text]