Deterministic Parsing of Languages with Dynamic Operators

Deterministic Parsing of Languages with Dynamic Op erators Kjell Post Allen Van Gelder James Kerr Baskin Computer Science Center University of California Santa Cruz, CA 95064 email: [email protected] UCSC-CRL-93-15 July 27, 1993 Abstract Allowing the programmer to de ne op erators in a language makes for more readable co de but also complicates the job of parsing; standard parsing techniques cannot accommo date dynamic grammars. We presentan LR parsing metho dology, called deferreddecision parsing , that handles dynamic op erator declarations, that is, op erators that are declared at run time, are applicable only within a program or context, and are not in the underlying language or grammar. It uses a parser generator that takes pro duction rules as input, and generates a table-driven LR parser, much like yacc. Shift/reduce con icts that involve dynamic op erators are resolved at parse time rather than at table construction time. For an op erator-rich language, this technique reduces the size of the grammar needed and parse table pro duced. The added cost to the parser is minimal. Ambiguous op erator constructs can either b e detected by the parser as input is b eing read or avoided altogether by enforcing reasonable restrictions on op erator declarations. Wehave b een able to describ e the syntax of Prolog, a language known for its lib eral use of op erators, and Standard ML, which supp orts lo cal declarations of op erators. De nite clause grammars DCGs, a novel parsing feature of Prolog, can b e translated into ecient co de by our parser generator. The implementation has the advantage that the token stream need not b e completely acquired b eforehand, and the parsing, b eing deterministic, is approximately linear time. Conventional implementations based on backtracking parsers can require exp onential time. Categories and Sub ject Descriptors: D.3.1 [Programming Languages]: Formal De nitions and Theory| syntax ; D.3.2 [Programming Languages]: Language Classi cations|applicative languages; extensible languages ; D.3.4 [Programming Languages]: Pro cessors|translator writing systems and compiler generators; parsing General Terms: Grammars, Parsers, Op erators Additional Key Words and Phrases: Parser Generators, Ambiguous Grammars, LR Grammars, Attribute Grammars, Context-Free Languages, Logic Programming Originally app eared in \Logic Programming | Pro ceedings of the 1993 International Symp osium ILPS93" 1 1 Intro duction and Background Syntax often has a profound e ect on the usability of a language. Although b oth Prolog and Lisp are conceptually ro oted in recursive functions, Prolog's op erator-based syntax is generally p erceived as b eing easier to read than Lisp's pre x notation. Prolog intro duced two innovations in the use of op erators: 1 1. Users could de ne new op erators at run time with great exibility . 2. Expressions using op erators were semantically equivalent to pre x form expressions; they were a convenience rather than a necessity. Notably, new functional programming languages, suchasML[21] and Haskell [14], are following Prolog's lead in the use of op erators. Prop er use of op erators can make a program easier to read but also complicates the job of parsing the language. Example 1.1: The following Prolog rules make extensive use of op erators to improve readability. X requires Y :- X calls Y. X requires Y :- X calls Z, Z requires Y. Here \:-" and \," are supplied as standard op erators, while \requires" and \calls"have programmer de nitions not shown. Languages that supp ort dynamic op erators have some or all of the following syntactic features: 1. The name of an op erator can b e any legal identi er or symb ol. Perhaps, even \built-in" op erators like \+", \", and \" can b e rede ned. 2. The syntactic prop erties of an op erator can b e changed by the user during parsing of input. 3. Op erators can act as arguments to other op erators. For example, if \_", \^", and \" are in x op erators, the sentence \_ ^"may b e prop er. 4. Op erators can b e overloaded , in that a single identi er can b e used as the name of two or more syntactically di erent op erators. A familiar example is \", which can b e pre x or in x. This pap er addresses problems in parsing such languages, presents a solution based on a generalization of LR parsing and describ es a parser generator for this family of languages. After reviewing some de nitions, we give some background on the problem, and then summarize this pap er's contributions. Later sections discuss several of the issues in detail. 1.1 De nitions We brie y review some standard de nitions concerning op erators, and sp ecify particular terminology used in the pap er. An operator is normally a unary or binary function whose identi er can app ear b efore, after, or b etween its arguments, dep ending on whether its xity is pre x , post x ,orin x . An identi er that has b een declared to b e op erators of di erent xities is said to b e an overloadedoperator .We shall also have o ccasion to consider nul lary op erators, which take no arguments. More general op erator notations, some of which take more than two arguments, are not considered here. Besides xity , op erators havetwo other prop erties, precedence and associativity , whichgovern their use in the language. An op erator's precedence ,orscope , is represented by a p ositiveinteger. This rep ort uses the Prolog convention, which is that the larger precedence numb er means the wider \scop e" and the weaker binding strength. This is the reverse of many languages, hence the synonym scope serves as a reminder. Thus \+" normally has a larger precedence numb er than \*"by this convention. 1 Some earlier languages p ermitted very limited de nitions of new op erators; see section 1.2 2 An op erator's associativity is one of left , right ,ornon .For our purp oses, an expression is a term whose principal function is an op erator. The precedence of an expression is that of its principal op erator. A left right asso ciative op erator is p ermitted to have an expression of equal precedence as its left right argument. Otherwise, arguments of op erators must havelower precedence rememb ering Prolog's order. Non-expression terms have precedence 0; this includes parenthesized expressions, if they are de ned in the grammar. We assume in the rest of this pap er some acquaintance with Prolog [5, 6]. A basic knowledge of LR parsing [1, 3] and the parser generator yacc [15] is also helpful. 1.2 Background and Prior Work Most parsing techniques assume a static grammar and xed op erator priorities. Excellent metho ds have b een develop ed for generating ecient LL parsers and LR parsers from sp eci cations in the form of pro duction rules, sometimes augmented with asso ciativity and precedence declarations for in x op erators [1, 3, 4, 9, 15]. Parser generation metho ds enjoy several signi cant advantages over \hand co ding": 1. The language syntax can b e presented in a readable, nonpro cedural form, similar to pro duction rules. 2. Emb edded semantic actions can b e triggered by parsing situations. 3. For most programming languages, the tokenizer may b e separated from the grammar, b oth in co de and in sp eci cation. 4. Parsing runs in linear time and normally uses sublinear stack space. The price that is paid is that only the class of LR1 languages can b e treated, but this is normally a suciently large class in practice. Earley's algorithm [8] is a general context-free parser. It can handle any grammar but is more exp ensive than the LR-style techniques b ecause the con guration states of the LR0 automaton are computed and 3 manipulated as the parse pro ceeds. Parsing an input string of length n may require O n time although 2 LRk grammars only take linear time, O n space, and an input-bu er of size n.Tomita's algorithm [27] improves on Earley's algorithm by precompiling the grammar into a parse table, p ossibly with multiple entries. Still, the language is xed during the parse and it would not b e p ossible to intro duce or change prop erties of op erators on the y. Incremental parser generators [11, 13] can b e viewed as an application of Tomita's parsing metho d. They can handle mo di cations to the input grammar at the exp ense of recomputing parts of the parse table and having the LR0 automaton available at run time. Garbage collection also b ecomes an issue. To our knowledge these metho ds have never b een applied to parse languages with dynamically declared op erators. Op erator precedence parsing is another metho d with applications to dynamic op erators [18] but it can not handle overloaded op erators. Permitting user-de ned op erators was part of the design of several early pro cedural languages, suchas Algol 68 [28] and EL1 [12], but these designs avoided most of the technical diculties by placing severe restrictions on the de nable op erators. First, in x op erators were limited to 7 to 10 precedence levels. By comparison, C has 15 built-in precedence levels, and Prolog p ermits 1200. More signi cantly, pre x op erators always to ok precedence over in x, preventing certain combinations from b eing interpreted naturally.C de nes some in x to take precedence over some pre x, and Prolog p ermits this in user de nitions.

Deterministic Parsing of Languages with Dynamic Operators

A Declarative Extension of Parsing Expression Grammars for Recognizing Most Programming Languages

Efficient Recursive Parsing

Advanced Parsing Techniques

Verifiable Parse Table Composition for Deterministic Parsing*

Basics of Compiler Design

Context-Aware Scanning and Determinism-Preserving Grammar Composition, in Theory and Practice

An Efficient Context-Free Parsing Algorithm for Natural Languages1

Comp 411 Principles of Programming Languages Lecture 3 Parsing

Head-Driven Transition-Based Parsing with Top-Down Prediction

9 Deterministic Top-Down Parsing

The CYK Algorithm Informatics 2A: Lecture 20

LNCS 9031, Pp