Deterministic Parsing of Languages
with Dynamic Op erators
Kjell Post Allen Van Gelder James Kerr
Baskin Computer Science Center
University of California
Santa Cruz, CA 95064
email: [email protected]
UCSC-CRL-93-15
July 27, 1993
Abstract
Allowing the programmer to de ne op erators in a language makes for more readable co de but also complicates
the job of parsing; standard parsing techniques cannot accommo date dynamic grammars. We presentan
LR parsing metho dology, called deferreddecision parsing , that handles dynamic op erator declarations, that
is, op erators that are declared at run time, are applicable only within a program or context, and are not in
the underlying language or grammar. It uses a parser generator that takes pro duction rules as input, and
generates a table-driven LR parser, much like yacc. Shift/reduce con icts that involve dynamic op erators
are resolved at parse time rather than at table construction time.
For an op erator-rich language, this technique reduces the size of the grammar needed and parse table
pro duced. The added cost to the parser is minimal. Ambiguous op erator constructs can either b e detected
by the parser as input is b eing read or avoided altogether by enforcing reasonable restrictions on op erator
declarations. Wehave b een able to describ e the syntax of Prolog, a language known for its lib eral use of
op erators, and Standard ML, which supp orts lo cal declarations of op erators.
De nite clause grammars DCGs, a novel parsing feature of Prolog, can b e translated into ecient
co de by our parser generator. The implementation has the advantage that the token stream need not
b e completely acquired b eforehand, and the parsing, b eing deterministic, is approximately linear time.
Conventional implementations based on backtracking parsers can require exp onential time.
Categories and Sub ject Descriptors: D.3.1 [Programming Languages]: Formal De nitions and Theory|
syntax ; D.3.2 [Programming Languages]: Language Classi cations|applicative languages; extensible
languages ; D.3.4 [Programming Languages]: Pro cessors|translator writing systems and compiler
generators; parsing
General Terms: Grammars, Parsers, Op erators
Additional Key Words and Phrases: Parser Generators, Ambiguous Grammars, LR Grammars, Attribute
Grammars, Context-Free Languages, Logic Programming
Originally app eared in \Logic Programming | Pro ceedings of the 1993 International Symp osium ILPS93" 1
1 Intro duction and Background
Syntax often has a profound e ect on the usability of a language. Although b oth Prolog and Lisp are
conceptually ro oted in recursive functions, Prolog's op erator-based syntax is generally p erceived as b eing
easier to read than Lisp's pre x notation. Prolog intro duced two innovations in the use of op erators:
1
1. Users could de ne new op erators at run time with great exibility .
2. Expressions using op erators were semantically equivalent to pre x form expressions; they were a
convenience rather than a necessity.
Notably, new functional programming languages, suchasML[21] and Haskell [14], are following Prolog's
lead in the use of op erators. Prop er use of op erators can make a program easier to read but also complicates
the job of parsing the language.
Example 1.1: The following Prolog rules make extensive use of op erators to improve readability.
X requires Y :- X calls Y.
X requires Y :- X calls Z, Z requires Y.
Here \:-" and \," are supplied as standard op erators, while \requires" and \calls"have programmer
de nitions not shown.
Languages that supp ort dynamic op erators have some or all of the following syntactic features:
1. The name of an op erator can b e any legal identi er or symb ol. Perhaps, even \built-in" op erators like
\+", \ ", and \" can b e rede ned.
2. The syntactic prop erties of an op erator can b e changed by the user during parsing of input.
3. Op erators can act as arguments to other op erators. For example, if \_", \^", and \" are in x
op erators, the sentence \_ ^"may b e prop er.
4. Op erators can b e overloaded , in that a single identi er can b e used as the name of two or more
syntactically di erent op erators. A familiar example is \ ", which can b e pre x or in x.
This pap er addresses problems in parsing such languages, presents a solution based on a generalization of
LR parsing and describ es a parser generator for this family of languages. After reviewing some de nitions,
we give some background on the problem, and then summarize this pap er's contributions. Later sections
discuss several of the issues in detail.
1.1 De nitions
We brie y review some standard de nitions concerning op erators, and sp ecify particular terminology used
in the pap er. An operator is normally a unary or binary function whose identi er can app ear b efore, after,
or b etween its arguments, dep ending on whether its xity is pre x , post x ,orin x . An identi er that has
b een declared to b e op erators of di erent xities is said to b e an overloadedoperator .We shall also have
o ccasion to consider nul lary op erators, which take no arguments. More general op erator notations, some
of which take more than two arguments, are not considered here. Besides xity , op erators havetwo other
prop erties, precedence and associativity , whichgovern their use in the language.
An op erator's precedence ,orscope , is represented by a p ositiveinteger. This rep ort uses the Prolog
convention, which is that the larger precedence numb er means the wider \scop e" and the weaker binding
strength. This is the reverse of many languages, hence the synonym scope serves as a reminder. Thus \+"
normally has a larger precedence numb er than \*"by this convention.
1
Some earlier languages p ermitted very limited de nitions of new op erators; see section 1.2 2
An op erator's associativity is one of left , right ,ornon .For our purp oses, an expression is a term whose
principal function is an op erator. The precedence of an expression is that of its principal op erator. A
left right asso ciative op erator is p ermitted to have an expression of equal precedence as its left right
argument. Otherwise, arguments of op erators must havelower precedence rememb ering Prolog's order.
Non-expression terms have precedence 0; this includes parenthesized expressions, if they are de ned in the
grammar.
We assume in the rest of this pap er some acquaintance with Prolog [5, 6]. A basic knowledge of LR
parsing [1, 3] and the parser generator yacc [15] is also helpful.
1.2 Background and Prior Work
Most parsing techniques assume a static grammar and xed op erator priorities. Excellent metho ds have b een
develop ed for generating ecient LL parsers and LR parsers from sp eci cations in the form of pro duction
rules, sometimes augmented with asso ciativity and precedence declarations for in x op erators [1, 3, 4, 9, 15].
Parser generation metho ds enjoy several signi cant advantages over \hand co ding":
1. The language syntax can b e presented in a readable, nonpro cedural form, similar to pro duction rules.
2. Emb edded semantic actions can b e triggered by parsing situations.
3. For most programming languages, the tokenizer may b e separated from the grammar, b oth in co de
and in sp eci cation.
4. Parsing runs in linear time and normally uses sublinear stack space.
The price that is paid is that only the class of LR1 languages can b e treated, but this is normally a
suciently large class in practice.
Earley's algorithm [8] is a general context-free parser. It can handle any grammar but is more exp ensive
than the LR-style techniques b ecause the con guration states of the LR0 automaton are computed and
3
manipulated as the parse pro ceeds. Parsing an input string of length n may require O n time although
2
LRk grammars only take linear time, O n space, and an input-bu er of size n.Tomita's algorithm
[27] improves on Earley's algorithm by precompiling the grammar into a parse table, p ossibly with multiple
entries. Still, the language is xed during the parse and it would not b e p ossible to intro duce or change
prop erties of op erators on the y.
Incremental parser generators [11, 13] can b e viewed as an application of Tomita's parsing metho d. They
can handle mo di cations to the input grammar at the exp ense of recomputing parts of the parse table and
having the LR0 automaton available at run time. Garbage collection also b ecomes an issue.
To our knowledge these metho ds have never b een applied to parse languages with dynamically declared
op erators.
Op erator precedence parsing is another metho d with applications to dynamic op erators [18] but it can
not handle overloaded op erators.
Permitting user-de ned op erators was part of the design of several early pro cedural languages, suchas
Algol 68 [28] and EL1 [12], but these designs avoided most of the technical diculties by placing severe
restrictions on the de nable op erators. First, in x op erators were limited to 7 to 10 precedence levels. By
comparison, C has 15 built-in precedence levels, and Prolog p ermits 1200. More signi cantly, pre x op erators
always to ok precedence over in x, preventing certain combinations from b eing interpreted naturally.C
de nes some in x to take precedence over some pre x, and Prolog p ermits this in user de nitions. For
example, no declarations in the EL1 or Algol 68 framework p ermit not X=Y to b e parsed as not X=Y.
Scant details of the parsing metho ds can b e found in the literature, but it app ears that one implementation
of EL1 used LR parsing with a mechanism for changing the token of an identi er that had b een dynamically
declared as an op erator \based on lo cal context" b efore the token reached the parser [12]. Our approach
generalizes this technique, p ostp oning the decision until after the parser has seen the token, and even until
after additional tokens have b een read. 3
op decls
sentences
sentences
standard deferred
op decls
parser parser rules parser parser
rules
generator generator
parse tree parse tree
Figure 1: Standard Parser Generator and Deferred Decision Parser Generator.
Languages have b een designed to allow other forms of user-de ned syntax b esides unary and binary
op erators. Among them, EL1 included \bracket- x" op erators and other dynamic syntax; in some cases
a new parser would b e generated [12]. More recently,Peyton Jones [16] describ es a technique for parsing
programs that involves user-de ned dist x op erators e.g. if-then-else-fi, but without supp ort for
precedence and asso ciativity declarations.
With the advent of languages that p ermit dynamic op erators, numerous ad hoc parsers have b een
develop ed. The tokenizer is often emb edded in these parsers with attendant complications. It is frequently
dicult to tell precisely what language they parse; the parser co de literally de nes the language.
Example 1.2: Using standard op erator precedence s, pre x \ " binds less tightly than in x \" in p opular
versions of Prolog. However, - 3 * 4 was found to parse as -3 * 4 whereas -X*4was found
to parse as - X * 4. Numerous other anomalies can b e demonstrated.
Indeed, written descriptions of the \Edinburgh syntax" for Prolog are acknowledged to b e approximations,
and the \ultimate de nition" seems to b e the program co de, read.pl.
1.3 Summary of Contributions
A metho d called deferreddecision parsing has b een develop ed with the ob jective of building up on the
techniques of LR parsing and parser generation, and enjoying their advantages mentioned ab ove, while
extending the metho dology to incorp orate dynamic op erators. The metho d is an extension of earlier work
by Kerr [17]. It supp orts all four features that were listed ab ove as b eing needed by languages with dynamic
op erators. The resulting parsers are deterministic, and su er only a small time p enalty when compared to
LR parsing without dynamic op erators. They use substantially less time and space than Earley's algorithm.
In \standard" LR parser generation, as done by yacc and similar to ols, shift/reduce con icts, as
evidenced bymultiple entries in the initial parse table, are resolved by consulting declarations concerning
static op erator precedence and asso ciativity [15]. If the pro cess is successful, the nal parse table b ecomes
deterministic.
The main idea of deferred decision parsing is that shift/reduce con icts involving dynamic op erators
are left unresolved when the parser is generated. At run time the current declarations concerning op erator
precedence and asso ciativity are consulted to resolveanambiguity when it actually arises g 1. Another
imp ortant extension is the ability to handle a wider range of xities, as well as overloading, compared to
standard LR parser generators. These extensions are needed to parse several newer languages.
The parser generator, implemented in a standard dialect of Prolog, pro cesses DCG-style input consisting
of pro duction rules, normally with emb edded semantic actions. The output is Prolog source co de for the
parser, with appropriate calls to a scanner and an op erator mo dule.
The op erator mo dule provides a standard interface to the dynamic op erator table used by the parser.
Thus, the language designer decides what language constructs denote op erator declarations; when such 4
constructs are recognized during parsing, the asso ciated semantic actions mayinteract with the op erator
mo dule. Pro cedures are provided to query the state of the op erator table and to up date it. Interaction with
the op erator mo dule is shown in the example in g 4.
We assume that the grammar is designed in suchaway that semantic actions that up date dynamic
op erators cannot o ccur while a dynamic op erator is in the parser stack; otherwise a grammatical monstrosity
might b e created. In other words, it should b e imp ossible for a dynamic op erator to app ear in a parse
tree as an ancestor of a symb ol whose semantic action can change the op erator table. Such designs are
straightforward and natural, so we see no need for a sp ecial mechanism to enforce this constraint. For
example, if dynamic op erator prop erties can b e changed up on reducing a \sentence", then the grammar
should not p ermit dynamic op erators b etween \sentences", for that would place a dynamic op erator ab ove
\sentence" in some parse trees.
Standard op erator-con ict situations are easily handled by a run-time resolution mo dule. However, Prolog
o ers very general op erator-declaration capabilities, which in theory can intro duce signi cant diculties with
overloaded op erators. Actually, these complicated situations rarely arise in practice, as users refrain from
de ning an op erator structure that they cannot understand themselves. In Edinburgh Prolog [5] for instance,
the user can de ne a symb ol as an op erator of all three xities and then use the symb ol as an op erator,
or as an op erand nullary op erator, or as the functor of a comp ound term. As the Prolog standardization
committee recognized in a 1990 rep ort [25],
\These p ossibilities provide multiple opp ortunities for ambiguity and the draft standard has
therefore de ned restrictions on the syntax so that a expressions can still b e parsed from left to
right without needing signi cant backtracking or lo ok-ahead:::"
A preliminary rep ort on this work [24] showed that many of the prop osed restrictions were unnecessary. The
ISO committee has subsequently relaxed most of the restrictions, but at the exp ense of more complex rules
for terms [26].
The mo dular design of our system p ermits di erent con ict resolution strategies and op erator-restriction
p olicies to b e \plugged in", and thus may serveasavaluable to ol for investigating languages that are
well-de ned, yet havevery exible op erator capabilities.
So-called de nite clause grammars DCGs are a syntactic variant of Prolog, in which statements are
written in a pro duction-rule style, with emb edded semantic actions [5]. This style p ermits input-driven
programs to b e develop ed quickly and concisely. Our parser generator provides a foundation for DCG
compilation that overcomes some of the de ciencies of existing implementations. These de ciencies include
the need to have acquired the entire input string b efore parsing b egins, and the fact that backtracking o ccurs,
even in deterministic grammars.
Our p oint of view is to regard the DCG as a translation scheme in which the arguments of predicates
app earing as nonterminals in the DCG are attributes; semantic actions may manipulate these attributes.
Essentially, a parser is a DCG in which all attributes are synthesized, and each nonterminal has a parse tree
as its only attribute. Synthesis of attributes is accomplished naturally in LR parsing, as semantic actions are
executed after the asso ciated pro duction has b een reduced. The thrust of our research has b een to correctly
handle inherited attributes. This work is discussed in Section 5.
2 Deferred Decision Parsing
The parser generator yacc disambiguates con icts in grammars by consulting programmer-supplied
information ab out precedence and asso ciativity of certain tokens, which normally function as in x op erators.
Deferred decision parsing p ostp ones the resolution of con icts involving dynamic op erators until run time.
As a running example we will use a grammar for a subset of Prolog terms with op erators of all three
xities. At run time the name of each op erator, together with its precedence, xity, and asso ciativity,is
stored in the current op erator table e.g. g. 2. The parser has access to the current op erator table and is
resp onsible for converting tokens from the scanner so that an identi er with an entry in the current op erator
table is translated to the appropriate dynamic op erator token. 5
Pre x In x Post x
Name Prec Assoc Prec Assoc Prec Assoc
+ 300 right 500 left