University of Amsterdam Programming Research Group

Scannerless Generalized-LR

Eelco Visser

Report P9707 August 1997

University of Amsterdam Department of Computer Science

Programming Research Group

Scannerless generalized-lr parsing

Eelco Visser

Report P9707 August 1997 E. Visser

Programming Research Group Department of Computer Science University of Amsterdam

Kruislaan 403 NL-1098 SJ Amsterdam The Netherlands

tel. +31 20 525 7590

e-mail: [email protected]

Acknowledgments The author thanks Arie van Deursen Jan van Eijck Annius Gro enink and Paul Klint

for useful suggestions and comments on previous versions of this pap er

This research was supp orted by the Netherlands Computer Science Research Foundation SION with

nancial supp ort from the Netherlands Organisation for Scientic Research NWO Pro ject

Incremental parser generation and contextsensitive disambiguation a multidisciplinary p ersp ective

Universiteit van Amsterdam

Contents

Intro duction

Scannerless GeneralizedLR Parsing

Architecture

Contributions

Overview

Advantages

Problems Solutions

Grammar Normalization

Normal Form

Normalization

Semantics

Disambiguation

Disambiguation by Priorities

Lexical Disambiguation

Longest Match

Prefer Literals

Automatic Lexical Disambiguation

Parser Generation

ShiftReduce Parsing

First

Follow

Goto Table

Action Table

Remarks

Automatic Lexical Disambiguation

Prefer Literals

Longest Match

Reject Pro ductions

Semantics

Expressive Power

GeneralizedLR Parsing

Parse Forest

Graph Structured Stack

Reject Reductions

The Algorithm

Remarks

Implementation

Related Work

Conclusions

Scannerless GeneralizedLR Parsing

Eelco Visser

Current techniques have a numb er of problems These

include the limitations of parser generators for deterministic languages and the

complex interface b etween scanner and parser Scannerless parsing is a pars

ing technique in which lexical and contextfree syntax are integrated into one

grammar and are all handled by a single contextfree analysis phase This ap

proach has a numb er of advantages including discarding of the scanner and

lexical disambiguation by means of the context in which a lexical token o ccurs

Scannerless parsing generates a numb er of interesting problems as well Inte

grated grammars do not t the requirements of the conventional deterministic

parsing techniques A plain contextfree grammar formalism leads to unwieldy

grammars if all lexical information is included Lexical disambiguation needs

to b e reformulated for use in contextfree parsing

The scannerless generalizedLR parsing approach presented in this pap er

solves these problems Grammar normalization is used to supp ort an expressive

grammar formalism without complicating the underlying machinery Follow re

strictions are used to express longest match lexical disambiguation Reject pro

ductions are used to express the prefer keywords rule for lexical disambiguation

The SLR parser generation algorithm is adapted to implement disambigua

tion by general priority and asso ciativity declarations and to interpret follow

restrictions GeneralizedLR parsing is used to provide dynamic lo okahead and

to supp ort parsing of arbitrary contextfree grammars including ambiguous ones

An adaptation of the GLR algorithm supp orts the interpretation of grammars

with reject pro ductions

Intro duction

Parsing is one of the areas of computer science where program generation is a

routine technique that is successfully applied to generate parsers for program

ming languages given their formal denition by means of a contextfree gram

mar At least in theory In practice most parser generators accept only a lim

ited subset of the contextfree grammars such as LL or LALR grammars

Since most natural grammars for languages do not resp ect these limitations

the language designer or writer has to b end over backwards to t the

grammar into the restrictions p osed by the grammar formalism by rewriting

grammar rules intro ducing adho c solutions for parse table conicts or resort

ing to side eects in the parser Even if one succeeds in pro ducing a grammar

that resp ects the restrictions a small extension or mo dication of the language

might jeopardize the careful balance of tricks which makes maintenance of to ols

for the language troublesome

Another source of problems in generated parsers is the division b etween the

phase and the contextfree analysis phase and the corresp onding

division of the grammar into a regular grammar dening the lexical syntax and

a contextfree grammar dening the contextfree syntax A scanner divides the

character string into tokens according to the lexical syntax A parser structures

the token string into a tree according to the contextfree syntax

At the interface b etween scanner and parser the lexical tokens are passed

from the scanner to the parser In the most straightforward scenario the scan

ner pro duces a stream of tokens without intervention from the parser This

entails that no knowledge of the parsing context is available in the scanner and

thus no lexical analysis decisions can rely on such information It is dicult

to unambiguously dene the lexical syntax of a language by means of only reg

ular grammars Therefore lexical analysis and the interface with contextfree

analysis are usually extended First lexical disambiguation heuristics such as

prefer longest match and prefer keyword are applied to reduce the numb er of

readings If there remain ambiguities after application of these rules the scanner

might pro duce multiple streams of tokens representing all p ossible partitionings

of the string into tokens according to the regular grammar The parser should

then b e able to cop e with this nonlinear input It is also p ossible to supply feed

back from the parser to the scanner to reduce the numb er of applicable grammar

rules For instance by sp ecifying the lexical categories that are exp ected for

the next token

In all such schemes lexical analysis b ecomes more complicated than the

simple nite automaton mo del that motivated the use of regular grammars

Contextfree parsing functionality starts to app ear b oth inside the scanner and

at the interface b etween scanner and parser and often op erational elements

corrupt the declarativity of the language denition As a consequence many

grammars are ambiguous if only the pure regular and contextfree grammar are

considered as such and reasoning ab out the language b eing dened b ecomes

dicult

Scannerless GeneralizedLR Parsing

In this pap er we describ e an approach to syntax denition and parser gener

ation that overcomes many of these problems The approach is based on the

integration and improvement of scannerless parsing generalizedLR parsing and

grammar normalization Because of the integration of the former two the ap

proach is called scannerless generalizedLR parsing

Scannerless Parsing Scannerless parsing is a parsing technique that do es

not use a scanner to divide a string into lexical tokens Instead lexical analysis is

integrated in the contextfree analysis of the entire string It comes up naturally

when considering grammars that completely describ e the syntax of a language

The term scannerless parsing was coined by Salomon and Cormack

They use complete character level grammars describing the entire syntax of a

language down to the character level Since conventional LR parser generation

yields tables with to o many conicts they use an extension of SLR parser

generation called noncanonical SLR However even this extension makes it

hard to dene a grammar without conicts

GeneralizedLR Parsing The conventional LR parsing techniques and esp e

cially scannerless LR parsing suer from conicts in the parse table There are

two causes for conicts in LR parse tables ambiguities and lack of lo okahead

If a conict is caused by an ambiguity any of the p ossible actions will lead to a

successful parse If it was caused by a lo okahead problem one of the actions will

lead to success and the others will fail Which action will b e successful cannot

b e decided statically Since ambiguity of a contextfree grammar is undecidable

Floyd it is also undecidable whether a conict is due to an ambiguity or

to a lack of lo okahead Because complete character level grammars frequently

need arbitrary length lo okahead metho ds to solve conicts in the table will not

always succeed

GeneralizedLR parsing is an extension of LR parsing that interprets the

conicts in the parse table by forking o a parser from the main parser for each

p ossible action in case of a conict If such a conict turns out to lead to an

ambiguity the parser constructs a parse forest a compact representation of all

p ossible parse trees for a sentence But if the conict was caused by lack of lo oka

head the forked parsers for the wrong track will fail In this manner lo okahead

is handled dynamically Therefore generalizedLR parsing is an ideal technique

to solve the lo okahead problems of scannerless parsing GeneralizedLR parsing

was intro duced by Tomita building on the theoretical framework of Lang

It was improved by Rekers to handle all contextfree grammars

In this pap er we extend Rekers version of the algorithm with reject reductions

a facility needed for lexical disambiguation

Grammar Normalization An asp ect of the division b etween lexical and con

textfree syntax that aects the sp ecication of syntax is the denition of layout

ie the whitespace and comments that can o ccur at arbitrary places b etween

tokens In the conventional setting layout is analyzed by the scanner and then

thrown away The parser never sees the layout tokens Therefore layout can

also b e ignored in the sp ecication of contextfree syntax However in a com

plete character level grammar all asp ects of the syntax are completely dened

including the syntax and p ositions of layout This can lead to rather unwieldy

grammars that declare the o ccurrence of layout as separator b etween all gram

mar symb ols in contextfree pro ductions

Grammar normalization is a technique used to dene an expressive gram

mar formalism in terms of simple contextfree grammars An example of a

normalization pro cedure is the addition of layout symb ols b etween the sym

b ols in contextfree pro ductions Other examples are the denition of regular

expressions by means of pro ductions and the attening of mo dular grammars

An imp ortant asp ect of the scannerless generalizedLR approach is the use of

grammar normalization to keep grammars small and usable The syntax deni

tion formalism SDF used in the approach is a formalism for concise denition

of complete character level grammars SDF is a generalization of the syntax

denition formalism SDF of Heering et al The formalism and normal

ization pro cedure is dened in Visser c

Architecture

The typical architecture of an application of SDF is depicted in Figure A

program text pro cessor that transforms text into text is comp osed of a

parser frontend that analyzes the input text and pro duces a structured repre

sentation of the text in the form of a the actual pro cessor that

p erforms a transformation from a parse tree to another one and a pretty

printer backend that pro duces text corresp onding to a transformed parse tree

Pro cessors can b e for instance interpreters data ow analyzers or

program transformation to ols

The input language of a pro cessor is sp ecied in the syntax denition formal

ism SDF Given a language denition in SDF and a tree to tree pro cessor the

corresp onding text to text pro cessor is constructed using a grammar normalizer

a parser generator a parser and a prettyprinter generator

Grammar Normalizer A language denition in SDF is normalized to a

plain contextfree grammar extended with character classes priority rules fol

low restrictions and reject pro ductions Normalization is briey discussed in x

A full denition of SDF normalization can b e found in Visser c

Parser Generator From a normalized syntax denition a parse table is gen

erated using an extension of the standard SLR algorithm with character

classes priorities follow restrictions and reject pro ductions The parser gener

ator accepts arbitrary contextfree grammars The techniques used in the parser

generator are discussed in x

Parser A parse table is interpreted by a generic language indep endent

SGLR parser which reads a text and pro duces a parse tree At the heart of the

parser is an extension of the GLR algorithm of Rekers that reads char

acters directly without using a scanner The extension of the GLR algorithm

with reject reductions is discussed in x

PrettyPrinter A prettyprinter is used to translate the output tree of the

pro cessor to text The pretty printer itself can also b e generated from the

denition of the output language This is describ ed in Van den Brand and

Visser and is not further discussed here

Contributions

The scannerless generalizedLR parsing approach presented in this pap er is a

new p owerful parsing metho d that supp orts concise sp ecication of languages

The technical contributions the details of which will b e discussed later on of

the approach are

The normalization of grammars to eliminate features enhancing the ex

pressivity of the formalism in particular the integration of lexical and

contextfree syntax by means of normalization into a single grammar

The use of GLR parsing for scannerless parsing to deal with unb ounded

lo okahead

1

Here text denotes a linear representation of a program in some character co de eg ASCI I

or UniCo de

0

L denition L denition

in SDF in SDF

Grammar

Normalizer

L denition

in SDF

nf

Parser Prettyprinter

Generator Generator

L parse

table

0 0

L parse L parse L pretty

Parser

tree tree printer

0

L text Pro cessor L text

Figure Architecture of an SDF application

Static disambiguation by means of priorities by interpreting priority dec

larations in the parser generator Priorities are completely expressed in

the parse table

The use of character classes in grammars to compact the parse table

The use of follow restrictions to dene longest match disambiguation and

the interpretation of follow restrictions in the parse table

Prefer literals disambiguation by means of reject pro ductions Several

expressivity results ab out contextfree grammars with reject pro ductions

Implementation of parsers for such grammars in an extension of the GLR

algorithm

Overview

In the next section we will argue in more detail that scannerless parsing has

a numb er of denite advantages over parsing with scanners but that it has

not b een intro duced b efore b ecause of the limitations of conventional parsing

techniques In the rest of the pap er we present several techniques that overcome

these limitations and result in a combined approach encompassing grammar

formalism and parsing techniques that do es make scannerless parsing feasible

Scannerless Parsing

The term scannerless parsing was coined by Salomon and Cormack

to indicate parsing without a separate lexical analysis phase using a scanner

based on a deterministic nite automaton The parser directly reads the char

acters of a text This entails the integration of the denition of lexical and

contextfree syntax in one grammar

Consider the following SDF denition of a simple language of expressions

consisting of identiers additions and multiplications

sorts Id Exp

lexical syntax

az Id

tn LAYOUT

contextfree syntax

Id Exp

Exp Exp Exp left

Exp Exp Exp left

contextfree priorities

Exp Exp Exp

Exp Exp Exp

The rst line declares the sorts say the nonterminals of the grammar The

next three lines declare the lexical syntax of the language such that identiers

are lists of one or more lowercase letters and layout consists of spaces tabs and

newlines The next four lines declare the contextfree syntax of the language

An expression is either an identier or an addition or multiplication of two

expressions Observe that the grammar is ambiguous and that in order to

disambiguate it priority and asso ciativity declarations have b een added The

last three lines declare that multiplication has higher priority than addition

The left attribute declares addition and multiplication to b e leftasso ciative

The conventional way to interpret such a grammar to parse a string is as

follows Divide the string into tokens according to the lexical syntax in

all p ossible ways Apply lexical disambiguation rules to select the desired

division For instance given the string ab the rule prefer longest match

would prefer the division over ie the longest p ossible identier is

ab a b

selected Throw away layout tokens Parse the resulting token string

according to the contextfree syntax The result is a parse tree that contains as

leafs the tokens yielded by lexical analysis

In scannerless parsing we have the following sequence Combine the

denition of lexical and contextfree syntax into a single contextfree grammar

All tokens on the lefthand side of pro ductions in the contextfree syntax are

"+"

+

\32

<[a-z]+-LEX> \32 <[a-z]+-LEX>

<[a-z]+-LEX> <[a-z]+-LEX> \32 \t c

a b

Figure Parse tree for the string abtc

explicitly separated by layout All grammar symb ols are renamed such that

the symb ols o ccurring in the lexical syntax have the form h LEXi and those in

CFi This is done to keep the two levels the contextfree syntax have the form h

separated For instance the addition pro duction is transformed into

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF

The symb ol hLAYOUTCFi represents the syntax of layout that can app ear b e

tween tokens In x this will b e explained in more detail The complete inte

grated grammar corresp onding to the denition ab ove is presented in Figure

on page Parse the characters of the string according to the normal

ized grammar The result is a parse tree that contains as leafs the characters

of the string The tokens are recognizable as subtrees For example consider

the parse tree in Figure Observe that the symb ols hLLEXi and hLCFi are

abbreviations for hLAYOUTLEXi and hLAYOUTCFi and denote layout no des

In a sense nothing is new In a conventional parser if we would instruct the

scanner to make each character into a corresp onding token the parser that reads

these tokens would in eect b e scannerless The reason that we distinguish

scannerless parsing from parsing with a real scanner is that the former generates

some sp ecial problems that are avoided by using a scanner

Advantages

Now that we have an understanding of what scannerless parsing is we might

ask why it is any go o d We will discuss the advantages one by one

No Scanner The obvious advantage of scannerless parsing is that no imple

mentation of a scanner and scanner generator is needed and that the complicated

interface b etween scanner and parser can b e eliminated

Integrated Uniform Grammar Formalism A language is completely dened

by means of one grammar All grammar rules are explicit and formally sp ecied

Lexical syntax and contextfree syntax are sp ecied with the same formalism

There is no longer a distinction b etween regular and contextfree grammars

This makes the formalism more uniform and orthogonal All features available

for lexical syntax are available for contextfree syntax and vice versa This

simplies use and implementation of the formalism

Disambiguation by Context Because of the integration of lexical and context

free syntax lexical analysis is guided by contextfree analysis If a token do es

not make sense at some p osition it will not even b e considered For instance

in the example ab ove the longest match rule do es not have to b e used to prefer

ab over a b b ecause the latter situationtwo adjacent identiersis never

syntactically correct

The paradigmatic example of contextdep endent lexical disambiguation is

the interplay b etween subrange typ es and oating p oint numb er constants in

Pascal Subrange typ es have the form kl where k and l are constants If

oating p oint numb er constants could have the form i and j with i and j

numb ers then ij could b e scanned either as i j ie the range from i to

j ie two adjacent oating p oint numb ers In scannerless parsing j or as

i

this ambiguity is solved automatically by context A scanner that has no access

to the context and applies the longest match rule would always cho ose the

second p ossibility two adjacent numb ers and fail Apparently for this reason

the syntax of Pascal only allows real numb ers of the form ij where i and j

are nonempty lists of digits Jensen and Wirth Similar examples can

b e found in many languages

Another example where the parsing context is relevant for making lexical

decisions is the syntax of lists of statements that can b e separated by semicolons

or newlines Consider the grammar

lexical syntax

tn LAYOUT

contextfree syntax

begin Stat n end Block

The lexical syntax denes newlines n to b e layout The contextfree syntax

denes blo cks as lists of zero or more statements starting with the keyword

begin and ending with the keyword end The list is declared by the construct

fA B g which declares a list of As separated by B s ie a list of the form

A B A B A In this case the separator is either a semicolon or a newline

This means that newlines are b oth layout and nonlayout If the disambiguation

rule prefer nonlayout is applied to the tokens of this language all newlines

even those not used as separator of statementswill b e wrongly characterized

as nonlayout A scannerless parser will recognize the newlines used as separator

simply by considering the parsing context

Conservation of Lexical Structure Scanners do usually not maintain the

phrase structure of the tokens they pro duce For example the grammar

lexical syntax

az Id

Id Path

denes the lexical syntax of path expressions as o ccur for instance in the

naming conventions of treestructured lesystems This syntax has to b e lexical

since no layout should o ccur b etween the identiers and separators of a path A

scanner would pro duce a string containing the characters of a path expression

without the structure assigned to it by the grammar ie the distinction b etween

identiers in the path is lost This entails that the semantic pro cessor must

reparse such tokens to deal with their internal structure

Conservation of Layout Scanners throw away the layout b etween tokens

of a phrase In this way the parser can ignore layout which simplies the

parsing problem However there are examples of op erations on programs that

require the structure of the program ie the parse tree but also the layout in

the source Examples are source to source translations transformations on the

source text and program do cumentation to ols Although a conventional parser

could b e instructed to add the layout to the parse tree via some detour this

would usually require a nonstandard extension of the metho d If the layout

would b e explicitly sp ecied in the grammar we would get an approach that is

very similar to scannerless parsing

Expressive Lexical Syntax Contextfree grammars provide a more expres

sive grammar formalism for lexical syntax than regular grammars This addi

tional expressive p ower op ens the way to concise denitions of nested comments

and syntactically correct expressions in comments For example consider the

following extension of the expression grammar ab ove that denes Clike com

ments as a list of comment words b etween and

sorts ComWord Comment

lexical syntax

tn ComWord

contextfree syntax

ComWord Comment

Comment LAYOUT

A comment word is a nonempty list of characters that are not whitespace or

Since the denition of comments is part of the contextfree syntax comment

words can b e separated by layout These comments are made into layout by the

last line of the grammar which is an injection of comment into layout Because

layout can o ccur b etween any two adjacent tokens comment can as well

According to this denition comments can b e nested b ecause comment

words are separated by layout which includes comments For instance the

string

h height

w width

d depth

is a syntactically correct expression over the grammar ab ove in which part of

an expression including comments is commented out This is a tedious job if

nested comments are not supp orted by the language

Moreover the following extension of the grammar ab ove denes that a com

ment word can also b e an expression b etween two s

contextfree syntax

Exp ComWord

This entails that comments can contain quoted expressions that must b e syn

tactically correct For instance the following sentence contains the expression

x y as part of a comment

a b an expression x y denotes

the addition of x and y

c

This is useful for typ esetting comments in literate programs and for generating

crossreferences

Problems Solutions

Now one might ask why scannerless parsing was not intro duced earlier if it

has so many advantages The answer is that there are several problems caused

by the integration of lexical and contextfree syntax as well In this pap er we

discuss solutions to these problems that make scannerless parsing feasible

Limitations of Parsing Techniques The main problem with scannerless pars

ing are the limitations of the conventional deterministic parsing techniques

Most complete character level grammars are not LR LL or even LRk

due to lo okahead needed for lexical elements When parsing with a scanner a

lo okahead of entails lo oking one token ahead In scannerless parsing a lo oka

head of entails only considering the next character Furthermore when layout

is skipp ed by the scanner this need not b e considered in the lo okahead The

solution used in the SDF implementation is to use the generalizedLR parsing

algorithm of Tomita and Rekers to get dynamic lo okahead

Grammar Size Another problem is the size of grammars Complete char

acter level grammars are large b ecause all constructs have to b e sp ecied down

to the character level Furthermore the placement of layout b etween tokens

should b e explicitly declared in pro ductions For maintenance and readability

of grammars this is problematic To supp ort the development of complete char

acter level grammars an expressive formalism is needed that hides the details

of the interface b etween lexical and contextfree syntax and of the placement

of layout In x we discuss the approach of grammar normalization in order to

provide an expressive formalism with a minimal semantic basis In x we discuss

the extension of contextfree grammars with various disambiguation constructs

to keep grammars concise

Lexical Disambiguation Although many lexical ambiguities are solved au

tomatically through the integration of lexical and contextfree syntax there are

still cases where disambiguation of lexical constructs needs to b e expressed

Since lexical analysis is now based on contextfree parsing familiar lexical dis

ambiguation rules such as prefer longest match and prefer keyword have to

b e redened and their implementation reconsidered In x we discuss two dis

ambiguation constructs for lexical disambiguation follow restriction and reject

pro ductions that suce to express all common lexical disambiguation rules

Interpretation of Disambiguation Rules There are a numb er of ways to

interpret disambiguation constructs One p ossibility is to implement them as

a lter on parse forests as prop osed in Klint and Visser However for

disambiguation of lexical constructs and contextfree expressions with priorities

this can lead to an exp onential size of the parse forest b efore ltering which

makes the metho d to o inecient In x we discuss the techniques used in parser

generation to enco de disambiguation rules in the parse tables such that decisions

are taken early In x an extension of the GLR parsing algorithm with reject

reductions is presented

Eciency The rst problem that comes to mind when considering scan

nerless parsing is eciency Since scanning with a nite automaton has a lower

complexity than parsing with a stack scannerless parsing ie replacing the

nite automaton part by a stack machine should b e less ecient The fol

lowing considerations led us to attempt scannerless parsing nonetheless

LR parsing is linear in particular for regular grammars Since lexical syntax

is traditionally formulated by means of regular grammars we should exp ect

linear b ehaviour for the lexical part of scannerless parsers The complete

complexity of the scannerparser setup should b e considered including lexical

disambiguation If lexical disambiguation rules cannot solve all ambiguities and

disambiguation has to b e deferred to the parser a kind of graph structured stack

has to b e maintained to keep track of the p ossible segmentation of the string in

tokens Such a setup is used in the ASFSDF MetaEnvironment Klint

that forms the background for the development of SDF It seems even more

ecient to maintain a single graph structured stack instead of two If more

complex grammars for lexical syntax are used we get into an area where scan

nerless parsing and parsing with scanners can no longer b e prop erly compared

b ecause such syntax is not expressible in the scanner framework Therefore the

worst case complexity of contextfree parsing should not b e taken as a reference

p oint for considering the complexity of scannerless parsing

Of course these considerations should b e veried by means of exp eriments

However exp eriments with scannerless parsing can only b e p erformed after

solutions have b een found for the other problems discussed ab ove It seems that

these problems are the cause for the late intro duction of scannerless parsing

rather than bad eciency of the metho d In x we will discuss a few simple

exp eriments that have b een p erformed with the scannerless parsing metho d

describ ed in this pap er and that seem to conrm our exp ectations

Grammar Normalization

We need an expressive grammar formalism in which lexical syntax and context

free syntax are integrated and that supp orts concise syntax denitions SDF

is such an expressive formalism It provides regular expressions lexical and

contextfree syntax character classes literals priorities mo dules renaming

and aliases The rst version of the formalism was describ ed in Visser b

The full denition is presented in Visser c Because it is exp ensive to

extend to ols to such an expressive formalism all features that are expressible

in more primitive features are eliminated by means of a normalization function

on grammars

Normal Form

The expressive p ower of the syntax denition formalism SDF can b e charac

terized by the equation

SDF contextfree grammars characterclasses priorities

reject pro ductions follow restrictions

That is any SDF denition is equivalent to a contextfree grammar making

use of character classes priorities reject pro ductions and follow restrictions All

other features are expressed in terms of these features The equivalence is such

that a denition is equivalent to a denition of the form

sorts s s

j

syntax p p

k

priorities pr pr

l

restrictions r r

m

where the s are sort symb ols the p are contextfree pro ductions of the form

i i

0

A the pr priority declarations of the form p R p with R a priority

i j j

relation and the r follow restrictions of the form cc with a list of

i

symb ols and cc a character class

A pro duction can have a numb er of attributes that may include the attribute

reject which makes the pro duction a reject pro duction A priority relation

is one of left right assoc nonassoc or A symb ol can b e a character

class or some other symb ol Only character classes are interpreted during parser

generation Other symb ols constructed using symb ol op erators are simply in

terpreted as a name For instance the symb ol A used to indicate the iteration

of symb ol A has no sp ecial meaning after normalization

Given a grammar G the following pro jection functions are dened

SG sorts of G

SymsG symb ols used in G

PG pro ductions of G

PrG priorities of G

RG restrictions of G

Normalization

As an example of the normal form consider the grammar in Figure It com

pletely describ es the lexical and contextfree syntax of expressions with identi

er multiplication and additionthe same language describ ed in the example

in x In fact this grammar is derived from that grammar by application of a

normalization pro cedure We briey discuss the elements of this normalization

that is formally sp ecied in Visser c Refer to Figure for examples of

the normalization rules

Lexical and Contextfree Syntax The most imp ortant asp ect of the normal

ization for this pap er is the integration of lexical and contextfree syntax The

pro ductions of lexical and contextfree syntax are merged In order to avoid

interference of lexical and contextfree syntax the symb ols in pro ductions are

renamed The symb ols in the lexical syntaxexcept for character classes and

literalsare renamed using the symb ol constructor h LEXi For instance Id

b ecomes hIdLEXi and b ecomes LEX Similarly

the symb ols in the contextfree syntax are renamed using h CFi Furthermore

the symb ols on the lefthand side of contextfree pro ductions are separated by

hLAYOUTCFi which entails that layout can o ccur at that p osition In this

way two disjunct sets of symb ols are created The interface b etween lexical and

contextfree syntax is now expressed by an injection hALEXi hACFi for each

symb ol A used b oth in the lexical and the contextfree syntax

Top Symbol A syntax denition denes a numb er of symb ols A text over

such a denition can b e one pro duced by any of its symb ols For contextfree

parsing we need a single start symb ol from which all strings are generated For

this purp ose for each sort A a pro duction

hLAYOUTCFi hACFi hLAYOUTCFi hSTARTi

is added to the grammar dening the start symb ol hSTARTi The pro duction

also denes that a string can start and end with layout Furthermore to express

the termination of a string the pro duction

hSTARTi EOF hStarti

denes that a string consists of a string generated by hSTARTi followed by the

end of le character

Character Classes Character classes are expressions of the form cr cr

n

0

where the cr are either characters or character ranges of the form c c Charac

i

ter classes are normalized to a unique normal form by translating the characters

to a numeric character co dethe ASCI I co deand by ordering and merging

the ranges such that they are in increasing order and do not overlap This

normalization is formally sp ecied and proven correct with resp ect to the set

interpretation of character classes in Visser b

Literals Literals are abbreviations for xed lists of characters Literals

are dened in terms of a pro duction with the literal as result and singleton

character classes corresp onding to the characters as arguments For example

the pro duction

let

denes the literal let as the sequence of characters l e and t in ASCI I

Regular Expressions An extensive set of regular expressions including op

tional alternative tupling several kinds of iteration and p ermutation are ex

pressed by means of dening pro ductions For instance consider the denition

of LEX in Figure which denes a list of one or more lowercase

letters

Priorities Priorities can b e declared using chains of declarations and

asso ciativities of pro ductions can b e declared using groups and attributes These

are all dened in terms of binary priority and asso ciativity declarations

sorts Id Exp

syntax

LAYOUTLEX

LAYOUTLEX LAYOUTCF

LAYOUTCF LAYOUTCF LAYOUTCF left

LAYOUTCF

LAYOUTCF LAYOUTCF

LEX

LEX LEX LEX

left

LEX IdLEX

IdLEX IdCF

IdCF ExpCF

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF left

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF left

LAYOUTCF IdCF LAYOUTCF START

LAYOUTCF ExpCF LAYOUTCF START

START EOF Start

priorities

LEX LEX LEX

left

LEX LEX LEX

LAYOUTCF LAYOUTCF LAYOUTCF left

LAYOUTCF LAYOUTCF LAYOUTCF

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF left

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF left

ExpCF LAYOUTCF LAYOUTCF ExpCF ExpCF

Figure Expression grammar in normal form The grammar contains no

restrictions or reject pro ductions

Modules An SDF denition can b e divided over a numb er of mo dules

Mo dules can imp ort other mo dules This is used to share common syntax de

nitions in several language denitions Renamings of symb ols and pro ductions

can b e used to adapt the denition in a mo dule to some application Fur

thermore symb ol aliases can b e used to abbreviate long regular expressions

Mo dular syntax denitions are completely expanded by the normalization func

tion

Semantics

A syntax denition denes a language ie a set of strings and the structure

that is assigned to those strings The strings of the language are imp ortant

to its users who write down programs The structure of those strings is im

p ortant for the denition of language pro cessors such as compilers interpreters

and typ echeckers The pro ductions of an SDF denition describ e b oth the

language and the structure assigned to strings in the language The semantics

of a syntax denition is a set of parse trees from which a set of strings can b e

derived The mapping from trees to strings is achieved by taking the yield of

a tree The reverse mapping from strings to trees is called parsing At this

p oint we formally dene the semantics of contextfree grammars without con

sidering disambiguation rules such as priorities reject pro ductions and follow

restrictions

A contextfree grammar G generates a family of sets of parse trees T G

T G X j X SymsG which contains the minimal sets T G X such that

c cc

Char

c T G cc

A A A PG t T G A t T G A

n n n

Pro d

t t A T G A

n

In rule Char c is a character and cc a character class We will write t for a

list t t of trees where is the list of symb ols X X and t T G X

n n i i

for i n Corresp ondingly we will denote the set of all lists of trees of typ e

as T G Using this notation t t A can b e written as t A

n

and the concatenation of two lists of trees t and t is written as t t and yields

a list of trees of typ e

The yield of a tree is the concatenation of its leafs The language dened by

a grammar G is the family LG LG X j X SymsG of sets of strings

that are yields of trees over the grammar ie LG X yieldT G X A

parser is a function that maps a string of characters to a set of parse trees A

parser accepts a string w if jw j A parser for a contextfree grammar

G that accepts exactly the sentences in LG is dened by

G w ft T G X j X SymsG yieldt w g

A parser is deterministic if jw j for all strings w A grammar is

ambiguous if there are strings with more than one parse tree ie jG w j

Disambiguation

Disambiguation metho ds are used to select the intendend tree from a set of p os

sible parse trees for an ambiguous string SDF provides three disambiguation

metho ds Priority and asso ciativity declarations are used to disambiguate con

cise expression grammars Follow restrictions and reject pro ductions are used

to express lexical disambiguation In this section we discuss these metho ds

Disambiguation by Priorities

By using priority and asso ciativity declarations fewer grammar symb ols have to

b e intro duced and a more compact abstract syntax can b e achieved Consider

the following grammar of expressions in a functional programming language

with binary function application and let binding

sorts Var Term

lexical syntax

az Var

tn LAYOUT

contextfree syntax

Var Term

Term Term Term left

let Var Term in Term Term

Term Term Term nonassoc

Term Term bracket

contextfree priorities

Term Term Term

Term Term Term

let Var Term in Term Term

An example term over this grammar is

let sum foldr plus zero in sum lst

The grammar is disambiguated by means of priorities The binary application

op erator is declared as leftasso ciative This entails that x y z should b e read as

x y z and not as x y z This is illustrated in Figure that shows the right

and leftasso ciative parse trees for three adjacent terms The priority declaration

denes applications to have higher priority than equalities Consider the trees

in Figure According to the priority declaration the rst tree has a priority

conict and therefore only the second tree is a correct parse tree The following

denition formally denes the notion of priority conicts

Denition Given some grammar G with priority declarations PrG the

set conictsG of priority conicts over grammar G is the smallest set of parse

tree patterns of the form B A such that

B A B PrG

CF

B A conictsG

B right nonassoc B A PrG

CF

B A conictsG

B left assoc nonassoc B A PrG

CF

B A conictsG

A parse tree over a grammar G has a priority conict if one of its no des matches

a pattern B A conictsG 2

Using the notion of priority conicts we can dene a lter on sets of parse

trees that selects the trees without a conict For example according to rule

CF and b ecause of the declaration of application as a leftasso ciative op era

tor the pattern

hTCFi hLCFi hTCFi hLCFi hTCFi hTCFi hTCFi

Figure Left and rightasso ciative parse trees for binary term application

"="

"="

Figure Two parse trees for application and equality

describ es a tree with a conict Term and LAYOUT are abbreviated to T and L

resp ectively Therefore the second tree in Figure has a conict and the rst

one is selected by the disambiguation metho d According to rule CF and

b ecause application has higher priority than equality the pattern

hTCFi hLCFi hLCFi hTCFi hTCFi hLCFi hTCFi hTCFi

is a memb er of the conicts generated by the functional language grammar

This means that the rst tree in Figure has a priority conict The second

tree has no conict

Lexical Disambiguation

If we consider the example of functional expressions again we see that it contains

two o ccurrences of lexical ambiguities

Longest Match In the rst place there is a longest match problem

caused by the syntaxless binary application op erator Two adjacent letters

could b e the concatenation of two letters forming a variable or it could b e the

application of two single letter variables Figure shows two parse trees for the

string fa In the rst tree the concatenation of letter lists is used to make them

into a single variable In the second tree each letter is interpreted as a variable

on its own We want to solve this ambiguity by means of the longest match

rule that prefers the longest p ossible lexical token In this case the string fa

as a single variable We dene the longest match notion formally by comparing

the lengths of tokens For this denition we rst need the notion of the token

stream asso ciated to a parse tree

Denition Token Stream The token stream asso ciated with a parse

tree is the list of subtrees that have as ro ot either an injection hALEXi hACFi

or a literal dening pro duction The length jtj of a token t is the numb er of

characters in its yield 2

According to this denition the token streams for the trees in Figure are

the single token

f azLEXa azLEX azLEX

VarLEX VarCF

and the tokens

f azLEX VarLEX VarLEX

a azLEX VarLEX VarLEX

The idea of longest match disambiguation is to compare two token streams from

left to right While the tokens have the same length the streams are similar

The rst token that diers in length solves the ambiguity by taking the tree

asso ciated with the longer token In the example ab ove the rst token stream

is larger b ecause its rst token has length while the rst token of the second

stream has length

<[a-z]+-LEX>

<[a-z]+-LEX> <[a-z]+-LEX> <[a-z]+-LEX> <[a-z]+-LEX>

f a f a

Figure Two parse trees for fa over the functional expression grammar

Formally we have the following denition of longest match disambiguation

Denition Longest Match Given the token streams t t asso ci

n

ated with the tree t and s s asso ciated with tree s tree t is larger in the

m

longest match ordering than s t s if there is some i minn m

lm lm

such that jt j js j for j i and jt j js j 2

j j i i

This denition can b e used as a metho d to lter parse forests by selecting

the largest trees according to the longest match ordering However b ecause

a longest match ambiguity causes an exp onential explosion of the parse forest

this is not feasible We need a metho d that can b e applied during parsing

if p ossible as a lter on parse tables A naive solution for the longest match

problem in the example ab ove is to require nonempty layout as a separator

b etween the two terms of an application In the example this would indeed solve

the problem b ecause the second tree would b e forbidden However this solution

is immediately refuted by considering the expression fa where brackets are

used around the argument

A metho d that works in all cases we have encountered so far is that of follow

restrictions A follow restriction of the form A A cc declares that the

n

symb ols A should not b e followed by any of the characters in the character

i

class cc In the example ab ove the restriction

lexical restrictions

Var az

forbids a variable to b e followed by a letter This entails that the second tree

in Figure violates the follow restrictions and the desired rst tree is selected

Prefer Literals The second problem in the functional expression gram

mar is the overlap b etween the literals let and in and variables This is

particularly problematic in combination with the op erator on terms A let

binding let x t in t can b e interpreted also as an equality let x

t in t where let and in are now read as variables We clearly want to

declare let and in as reserved words of the language that should not b e used

as variables This lexical disambiguation rule is called prefer literals and can

b e dened formally as follows

Denition Prefer Literals A tree violates the prefer literals rule if it

contains a subtree with function hALEXi hACFi and the yield of that tree

is also used as literal in the grammar 2

This rule can b e expressed by means of reject pro ductions A reject pro duc

tion is a pro duction A attributed with the attribute reject It declares

that a string is not of typ e A if it can also b e derived from For example to

disambiguate the grammar ab ove we add the following pro ductions

lexical syntax

let Var reject

in Var reject

This creates an ambiguity let can b e a variable in two ways via the lexical

denition or via the pro duction ab ove Because this is a reject pro duction b oth

derivations are forbidden ie let can only o ccur in the context of a let binding

We also need the restrictions

lexical restrictions

let in az

to prevent letter to b e interpreted as the literal let and the variable ter We

will further discuss some prop erties of reject pro ductions in x

Automatic Lexical Disambiguation We have dened two extensions of

contextfree grammars that enable us to express lexical disambiguation rules on

grammars for integrated lexical and contextfree syntax However it is desirable

to derive the rules for lexical disambiguation automatically from the grammar

In x we will discuss this issue after we have discussed parser generation

Parser Generation

We have discussed a grammar formalism with disambiguation metho ds for con

cise denition of lexical and contextfree syntax of languages Now we turn our

attention to deriving parsers from such syntax denitions In this section we

present the rules for the generation of parse tables for a shiftreduce parser The

rules constitute a mo dication of the well known SLR algorithm We rst

discuss shiftreduce parsing

ShiftReduce Parsing

A shiftreduce parser is a transition system that manipulates its state consisting

of a stack and an input stream by rep eatedly shifting a symb ol from the input to

the stack or reducing a numb er of elements on top of the stack to a single element

until it enters an accepting state The transitions b etween parse congurations

are determined by the functions actions and goto as dened by the following

transition rules

actionss a shifts

m i m

Shi

s t s t s a a s t s t s a s a a

m m i n m m i m i n

actionss a reducep k

mk i

s gotos p t treep t t

m m mk

Red

s t t s t s t s a a

m m m m mk mk i n

s t t s ts a a

m m i n

actionss EOF accept

Acc

s t s EOF acceptt

Here a conguration s t s t s a a consists of a stack on the left

m m i n

side of the and a list of input characters on the right side of the The stack is

lled alternatingly with states s and trees t Parsing starts in the conguration

C s a a where s is the initial state of the parser Parsing succeeds

n

if there is some sequence of steps C C acceptt that ends in the

accepting conguration acceptt

There are various ways to dene the actions and goto functions that drive

a shiftreduce parser The SLR algorithm of DeRemer and Anderson

et al is a simplication of the LRk parsing algorithm of Knuth

It works by rst constructing an LR parse table This involves no lo okahead

sets in the parse items The lo okahead of reductions is constrained to the follow

set of the nonterminal dened by the pro duction b eing reduced

In the rest of this section we describ e a mo dication of the SLR algo

rithm that incorp orates priorities and follow restrictions This mo dication is

based on the derivation in Visser a where starting with a schema for Ear

leys parsing algorithm a parsing schema is derived such that the parser do es

not build trees with priority conicts Other changes are the use of character

classes the use of pro ductions instead of symb ols in follow and goto and the

interpretation of follow restrictions to restrict the lo okahead set of reductions

First

The rst set for a symb ol contains those symb ols with which a phrase for the

symb ol can start Given some grammar G dene for each list of symb ols and

each character class cc the rst characters in followed by cc is the smallest

character class rst cc such that

rst cc cc Fi

0 0

rstcc cc cc Fi

A PG

Fi

rstA cc rst cc

The denition of the rst set can b e extended to the set of symb ols that starts

a sentence derived from a list of symb ols

A PG

Fi

rst A fAg rst

s

Follow

In the conventional SLR algorithm the follow set is computed for each non

terminal of the grammar It maps a nonterminal to the set of terminals that

can follow that nonterminal in a sentence ie



A hStarti

followA rst

This can b e computed as the closure of

A B PG

Fo

followA rst followB

that adds the characters in the rst set of to the follow set of A if follows

A in some pro duction The follow of B is added in case can also pro duce the

empty string

This notion can b e rened to the followset of pro ductions The rule

A B PG

Fo

follow A rst follow A B

denes the followset of pro duction A as those characters that can follow

A in some context In case of plain contextfree grammars rule Fo has the

same eect as rule Fo But if we consider priorities the rule is extended to

A B PG A B conictsG

Fo

follow A rst follow A B

Here the followset of a pro duction is restricted to those contexts where it can

actually b e used without causing a priority conict For instance in the expres

sion grammar of x the followset of the addition pro duction do es not contain

the character b ecause addition can not o ccur as a direct descendant of multi

plication

Finally if the grammar also denes follow restriction rules A cc the

followset of a pro duction for A can b e further restricted as

A B PG A B conictsG A cc RG

Fo

follow A rst follow A B n cc

The pro duction can b e followed by the dierence of the rst set of the right

context and the character class cc

To see the eect of the last rule consider the followset of the pro duction

az Var in the functional expression grammar of the previous section

Because of the application pro duction and the injection of variables into terms

the followset of az Var is EOFtn az The lexical re

striction Var az removes the character class az from this followset

resulting in EOFtn This entails that a variable cannot directly

b e followed by a letter

Goto Table

The states of an LR parser are formed by itemsets An item is an ob ject of the

form A ie a contextfree pro duction with a somewhere in b etween

the symb ols on the lefthand side Such an item indicates that a sentential form

of typ e has already b een recognized

The initial state of the parser for grammar G is the itemset initG dened

as

init G closurehSTARTi EOF hStarti

This state expresses that a sentence can b e recognized by recognizing a string

of sort hSTARTi followed by the sp ecial end of le character that indicates the

end of strings

The closure of a set of items adds all initial items to an item set for which

the result symb ol is predicted by one of the items in the set

I closureI Cl

B A closureI B PG

Cl

B closureI

In the presence of priorities the closure is restricted to those items that do not

cause a priority conict

B A closureI B PG

B A conictsG

Cl

B closureI

For example the item E E E is not added to the closure as a result of the

item E E E if this pro duction is leftasso ciative b ecause E E E

E E is a conict pattern

The parsing of a string starts with the parser in the initial state Up on

recognition of a symb ol either by reading a character or by completing a pro

duction the parser can enter other states as prescrib ed by the transitions of the

goto graph The goto function maps an itemset to another itemset given the

symb ol that has b een recognized The function goto is dened by

gotoX I closureshift X I

ie create a new itemset by shifting the over the symb ol X and pro duce the

closure of the resulting itemset In normal LR parsing a shift with a symb ol

B creates an itemset containing all items of the previous set that have the

b efore a B symb ol

B A I

Sh

B A shift B I

We rene the denition of shift to shifting with characters and shifting with

pro ductions Shifting an itemset with a character or characterclass is dened

by the rule

0 0

cc A I cc cc

Sh

0

cc A shift cc I

The characterclass cc induces a shift of each item that predicts a characterclass

0

cc that is a sup erset of cc

Shifting with nonterminals is rened to shifting with complete pro ductions

A shift is only successful if the pro duction do not cause a priority conict as a

direct descendant at the p osition of the predicted symb ol

B A I B A conictsG

Sh

B A shift B I

For example the pro duction E E E cannot b e used to shift the item

E E E if this pro duction is leftasso ciative b ecause E E E E

E is a conict This restriction of the closure and goto functions guarantees

that we can never enter a state where we have built a parse tree with a priority

conict

Action Table

The action table declares the actions to b e taken in each state Given an item

set the function actions maps a character to the set of actions that the parser

can take If the set of actions is empty the parser has reached an erroneous

state If the set contains more than one action there is more than one way to

pro ceed

cc A I c cc

Shi

actionsI c shiftgotoc I

A I c follow A

Red

actionsI c reduce A jj

hSTARTi EOF A I

Acc

actionsI EOF accept

Note that shiftI denotes the shift action to state I whereas shiftX I is

the application of the shift function dened ab ove

The following prop osition states that the actions and goto functions dened

ab ove constitute a correct shiftreduce parser

Prop osition Correctness Given the actions and goto functions for a



grammar G we have that init G w acceptt i t G w and t con

tains no priority conicts according to PrG and violates no fol low restrictions

in RG

Remarks

The transition rules for shiftreduce parsing are nondeterministic If more than

one action is p ossible in some conguration more than one transition is p ossible

If the actions function is deterministic at most one transition path is p ossible

for a string Traditional parsing techniques only accept grammars that have a

deterministic action function In x we will discuss an ecient implementation

for nondeterministic actions functions

The rules for parser generation ab ove ignore reject pro ductions ie they are

treated just like other pro ductions In x we will discuss how reject pro ductions

can b e interpreted by means of a lter on parse forests In x we will discuss how

reject pro ductions can b e interpreted during parsing by means of an adaptation

of the GLR algorithm For this purp ose an itemset I is marked as rejectable

if it can b e reached using a reject pro duction ie I is rejectable if there is an

0 0

I such that goto A I I and A is a reject pro duction

Automatic Lexical Disambiguation

In x we discussed the sp ecication of lexical disambiguation by means of follow

restrictions and reject pro ductions Although this is an eective way to express

lexical disambiguation it is rather tedious to write down the rules Therefore

it would b e desirable to derive lexical disambiguation rules automatically from

the other grammar rules such that the grammar is disambiguated according to

the longest match and prefer literals criteria Here we discuss some p ossibilities

The p erfect rules for longest match disambiguation have not b een found yet It

is a question whether this is p ossible at all since it is undecidable whether a

contextfree grammar is ambiguous

Prefer Literals

The prefer literals disambiguation rule can b e expressed by generating reject

pro ductions according to the following rule

c c LhALEXi

n

c c hALEXi frejectg PG

n

ie if the literal is a lexical phrase of sort hALEXithere is an overlapthe

reject rule is added to the grammar This implements the reserved keywords

rule The only implementation problem is that a parser is needed to recog

nize the literals as lexicals This can b e solved by rst generating a parser for

the grammar without reject rules and using that parser to determine overlap

b etween literals in the grammar and lexical categories Reject rules can then b e

added to the grammar accordingly and a new parser can b e generated for the

extended grammar

State Explosion A problem with reject pro ductions to exclude keywords as

lexicals is that it can add many items to itemsets For instance if a language

contains keywords that match with the identiers of the language each item

set containing an item hIdLEXi A would b e expanded with items

c c hIdLEXi frejectg and items c c c c

n n n

along with many extra transitions To prevent this expansion we dene the

rejection of literals in an indirect way as follows

c c LhALEXi

n

hALITi hALEXi frejectg PG

c c hALITi PG

n

where the symb ol denotes the empty phrase ie there is a pro duction

The sort hALITi is used to collect all literals to b e rejected from hALEXi The

pro duction hALITi hACFi frejectg denes the rejection for all literals

at once The eect of the empty symb ol in the second pro duction is that only

the item hALITi hALEXi is added when hALEXi is predicted This

will cause a reduction with the pro duction to an itemset where hALITi

is predicted This itemset is only computed once and is reused for all other

itemsets that predict hALEXi It is the initial state of a nite automaton for

the matching of literals

As an example consider how the prefer literals rule for our functional lan

guage example is expressed using this mo died rule

syntax

VarLIT VarLEX reject

let VarLIT

in VarLIT

Local Exclusion An alternative for the expression of the prefer literals rule

is the rule

fhALEXi hACFi c c c c g closureI

n n

c c hACFi frejectg closureI

n

that lo cally forbids predicted literals as lexicals by extending the parser gen

erator This do es not implement the reserved keywords rule in the sense of

forbidding the use of a keyword as a lexical in all p ositions Only when a lexical

and a literal can app ear in the same place the literal is preferred Therefore it

might still lead to ambiguities

Longest Match

It is less straightforward to nd a general rule to express longest match using

follow restrictions An attempt is the rule

hB LEXi follow hALEXi hACFi

s

FR

hALEXi rsthB LEXi lasthALEXi RG

This restricts the follow set of hALEXi by excluding the elements of the rst

set of hB LEXi that can also b e used at the end of hALEXi for those hB LEXis

that can follow the injection hALEXi hACFi Here follow is the extension

s

of the follow function to pro duce all symb ols that can follow a pro duction

This rule is adequate in many cases Consider for instance the functional

expression grammar The follow restriction for Var in x is derived exactly

using this rule However the rule is not general enough One counter example is

the following grammar of expressions with single character variables and implicit

multiplication op erator This describ es mathematical expressions such as xy

that denotes the multiplication of x and y

lexical syntax

az Var

tn LAYOUT

contextfree syntax

Var Exp

Exp Exp Exp left

Rule FR would forbid xy as an expression forcing the use of whitespace ie

x y Although this example shows that rule FR is unsound if considered as

an analytic rule one could also consider it as a normative rule forcing a clearer

style of language denition

Rule FR generates follow restrictions for lexicals We also need restrictions

for literals overlapping with lexicals For instance the restrictions

lexical restrictions

let in az

forbids the interpretation of letter as the literal let and the variable ter The

following rule adds restrictions to prevent this overlap

hALEXi followc c

n

c rsthALEXi c c c LhB LEXi

n

c c c RG

n

If the literal L c c followed by some character c from the rst set of

n

a lexical hALEXi that is a memb er of the follow set of L can form a lexical

hB LEXi there is a longer match than the literal L Therefore c is restricted

from the follow set of the literal

This rule is stronger than the longest match lter we formulated b efore

It can forbid sentences that have a single unambiguous interpretation For

instance consider the string let x l int Here int is forced to b e read as

a variable and not as the juxtap osition of the literal in and the variable t

It is clear that these rules are not the nal word ab out fully automatic lexical

disambiguation Further research is needed to decide what is sucient

Reject Pro ductions

In x we intro duced reject pro ductions to express prefer literals lexical dis

ambiguation The parser generator discussed in x treats reject pro ductions as

normal pro ductions This will cause ambiguous parses for those cases where

a normal pro duction and a reject pro duction overlap In this section we rst

dene the semantics of contextfree grammars with reject pro ductions then we

investigate several prop erties of such grammars including an interpretation of

rejects to solve such ambiguities The author thanks Jan van Eijck and Annius

Gro enink for the email discussion that led to the results in this section

Semantics

The semantics of reject pro ductions is obtained by rening the inductive de

nition of parse trees from x The inductive rule Pro d is restricted to exclude

the construction of parse trees that have a yield that could b e obtained via a

reject pro duction

A contextfree grammar G with reject pro ductions generates a family of sets

of parse trees T G T G X j X SymsG which contains the minimal

r r

sets T G X such that

r

c cc

CharR

c T G cc

r

A A A PG t T G A t T G A

n r n r n

A frejectg PG t T G yieldt yieldt t

r n

t t A T G A

n r

Pro dR

The second condition of Pro dR excludes from T G A those trees for which

r

an A tree with the same yield could b e built using a reject pro duction at its

ro ot This second condition is the only dierence with the denition of T G

ie we have T G T G Note that only trees in T G are excluded

r r

That is if there are nested reject pro ductions such that some tree in T G

is rejected and thus not part of T G then it is not used to exclude trees

r

using A frejectg

Unfortunately this denition is inconsistent for grammars with a cycle con

taining a reject pro duction For instance consider the grammar

syntax

a A

A B

B A reject

and consider whether the string a is a memb er of the language of this grammar

if a A T A then a A B T B and hence a A T A

r r r

Conversely if a A T A then a A B T B and hence

r r

a A T A For this reason we restrict the class of grammars that we

r

want to consider to grammars that do not contain a cycle disregarding the

rejects for which one of the transitions is via a reject pro duction

Expressive Power

In the rest of this section we explore some of the prop erties of reject pro ductions

n n n

a b c First of all contextfree grammars with reject pro ductions can b e

used to describ e some noncontextfree languages Consider for example the

n n n

language a b c with n which is a standard example of a noncontextfree

language The following grammar due to Van Eijck denes this language

  

using reject pro ductions The rst four pro ductions dene the language a b c

n n

The next four pro ductions dene the sorts D and E denoting resp ectively a b

n n

and b c The last four pro ductions exclude from sort S all strings for which

n m

one of the pairs x y have unequal numb ers of xs and y s

A B C S D D B C S reject

a A A D B D A D C S reject

b B E A B E S reject

c C B E C E A E C S reject

Dierence Given a contextfree grammar dening sorts A and B we can

dene the dierence of the languages of these sorts by adding the following

pro ductions

A AminB

B AminB reject

The rst adds all A trees to AminB the second excludes from this all A trees that

match with a B tree

Intersection Extending this result we can express the intersection b etween

sorts A and B by adding two new sorts AminB and AandB and by adding the

following pro ductions

A AminB A AandB

B AminB reject AminB AandB reject

This denes AminB as the dierence A B and AandB as the dierence A A B

ie the intersection of A and B

We can generalize the results ab ove Given two contextfree languages we

can express the dierence and intersection of those languages using contextfree

grammars with reject pro ductions Take the union of the contextfree grammars

for the two languages after renaming symb ols to prevent interference Then add

pro ductions for the sorts to b e intersected as explained ab ove

Weak Complement If we are only interested in the strings that can b e

generated from a grammar and not in their structure the complement of a

the language generated by sort A is dened by extending a grammar with the

following rules

NotA

A NotA reject

The rst pro duction denes the complement of A as a string of arbitrary char

acters The complement of the empty character class is the character class

with all characters The second pro duction excludes from this language all

strings in the language of A Using this complement we can of course also ex

press the weak intersection of two sorts

Decidable We have seen that contextfree grammars with reject pro ductions

are very expressive It is now appropriate to ask whether it is even decidable

whether a string is in the language of such a grammar The following theorem

states that this is indeed the case The pro of uses the notion of a parse forest

that will b e discussed in the next section For the pro of of the theorem we need

the following prop osition ab out generalizedLR parsers

Prop osition Let G be a contextfree grammar If t t T G A and

yieldt yieldt w then a GLR parse of w wil l result in an ambiguity

node with t and t as possibilities

Theorem The parsing problem for contextfree grammars with reject pro

ductions without rejects in cycles is decidable

"let" <[a-z]+-LEX>

l e t <[a-z]+-LEX> <[a-z]+-LEX>

<[a-z]+-LEX> <[a-z]+-LEX> t

l e

Figure Ambiguity no de caused by overlap b etween syntax for VarLEX and

reject pro duction let VarLEX

Pro of Given a contextfree grammar G with reject pro ductions without

rejects in cycles construct a generalizedLR parser for G ignoring the reject

annotations The result is a parser for a p ossibly ambiguous contextfree gram

mar Now given a string parse it with this parser If parsing fails the string

is also not in the language of the grammar with reject pro ductions Otherwise

the result of parsing is a parse forest Since cycles do not contain rejects these

can b e removed from the forest

Now if a tree t t t A should b e rejected according to the second

n

condition of rule Pro dR there is a reject pro duction A and trees t such

that yieldt yieldt t But then yieldt A yieldt t

n n

A and hence according to the prop osition ab ove the parse forest contains an

ambiguity no de on top of t also containing t A as p ossibility

Reject pro ductions are now interpreted by traversing the forest in a b ottom

up manner marking tree no des according to the following rules Leafs are

not marked A reduction no de is marked if any of its direct descendants is

marked An ambiguity no de is marked if either all its direct descendants are

marked or if it contains an unmarked tree with as ro ot lab el a reject pro duction

Since the parse forest is nite this pro cedure terminates

If the ro ot of the parse forest is marked after this pro cedure the string is

not accepted by the grammar otherwise it is accepted and the forest without

marked no des represents all parse trees for the string 2

The tree in Figure illustrates the pro of The overlap b etween the literal

let and the syntax for variables causes an ambiguity The ambiguity no de

is marked b ecause let VarLEX is a reject pro duction Therefore the

interpretation of let as a variable is dismissed

This shows that we can construct a complete implementation of parsers

for grammars with reject pro ductions In the next section we will discuss how

reject pro ductions can b e interpreted during parsing to inuence parse decisions

to prevent trees with rejected subtrees from b eing built at all

Expressive Power From the ab ove we can conclude that contextfree gram

mars with reject pro ductions are stronger than pure contextfree grammars but

have a decidable parsing problem This gives a lower b ound and upp er b ound

for the expressive p ower of the formalism but it is an op en question what class

of languages is describ ed by contextfree grammars with reject pro ductions

Regular Rejects We intro duced reject pro ductions in order to express the

prefer literals rule This means that only a regular language is excluded from a

contextfree one This gives us the guarantee that the resulting language is still

contextfree We could exploit this prop erty and restrict the formalism to such

regular reject pro ductions and implement these by means of a grammar trans

formation However such a grammar transformation would probably yield large

grammars Furthermore our implementation gives a general way to express the

prefer literals rule and it allows the expression of other interesting grammars

that have not b een in the reach of declarative sp ecication This feature can

give rise to as yet unforeseen applications

GeneralizedLR Parsing

In x we have dened the generation of shiftreduce parsers from contextfree

grammars with priority declarations and follow restrictions If the actions func

tion derived from a grammar is deterministic then the shiftreduce parser is also

deterministic and can b e implemented in a standard way

However since we do not restrict the class of grammars it is not guaranteed

that the actions function is deterministic This can have two causes The

lo okahead needed for the grammar is more than provided by the parser gener

ator The grammar is ambiguous In the case of scannerless parsing we will

frequently see grammars for which unb ounded lo okahead is needed This entails

that no variant of the LR parser generation algorithms will pro duce a determin

istic actions function Therefore we need a nondeterministic implementation

of the shiftreduce parsing algorithm When a conguration is reached where

more than one action is p ossible all p ossibilities should b e tried In case of

unb ounded lo okahead only one of the p ossible transitions leads to an accepting

conguration In case of an ambiguous string multiple accepting congurations

will b e reached giving all p ossible parse trees for the string

The advantage of such a nondeterministic approach is rst of all the un

b ounded lo okahead that it provides Furthermore a parser pro ducing all parse

trees for an ambiguous string can b e used as a frontend for a disambiguation

lter that selects the correct tree according to some disambiguation metho d

Finally it is undecidable whether a grammar is ambiguous or has lo okahead

problems Having a parser that yields all p ossible parses can help in detect

ing the ambiguities and resolve them in a much easier way than by insp ecting

conicts in a parse table

A naive way to implement such a nondeterministic parsing algorithm is

to copy the entire conguration at each p oint where two or more actions are

p ossible and to continue parsing with each those congurations This will not

b e very ecient b ecause of the memory requirements and b ecause it will not

reuse parses for substrings that are the same in two forked o congurations

GeneralizedLR parsing is an ecient implementation of nondeterministic shift

reduce parsing A GLR parser deals with conicts in the parse table by splitting

the parser into as many parsers as there are conicts If the conict was due to

a lack of lo okahead some of the parsers will not succeed in parsing the sentence

and will die If several parsers succeed in parsing the grammar was ambiguous

In that case parse trees for all p ossible parses are built

GeneralizedLR parsing was develop ed for natural language pro cessing by

Tomita It is a sp ecialization of the more general framework of Lang

later also describ ed in Billot and Lang for creating generalized

parsers The algorithm was improved by Rekers and applied to parsing of

programming languages The feasibility of GLR parsing for parsing of program

ming languages has b een shown by the exp erience with GLR in the ASFSDF

MetaEnvironment Klint More exp erience with GLR parsing of pro

gramming languages using an adaptation of Rekers algorithm is rep orted by

Wagner and Graham

Besides the nondeterminism in the parse table we also need to interpret the

reject pro ductions in the grammar In the previous section we showed how reject

pro ductions can b e interpreted as a disambiguation lter after parsing But we

would rather interpret them earlier In this section we explain GLR parsing and

present an adaptation of the algorithm to interpret reject pro ductions during

parsing

Parse Forest

A generalized parser deals with ambiguous grammars by pro ducing all p ossible

parse trees for an ambiguous string In GLR parsing the p ossible parse trees

are represented by means of a parse forest This is a compact representation of

a set of parse trees A parse tree is constructed using application and ambiguity

no des An application no de represents the application of a pro duction to a list

of subtrees An ambiguity no de represents a set of p ossible parse trees for a

substring By packing all trees for a substring into an ambiguity no de these

parses can b e shared in all trees for strings containing the substring

For example consider the following grammar of simple expressions with

ambiguous addition and multiplication op erator

sorts Exp

syntax

az Exp

Exp Exp Exp

Exp Exp Exp

Exp START

To keep the example small layout is not allowed b etween the tokens The

parse forest for the ambiguous string abc is shown in Figure The ellipse

represents an ambiguity no de Observe that various subtrees are shared in the

forest

Graph Structured Stack

A GLR parser deals with conicts in the parse table by maintaining a numb er of

stacks in parallel Each time a parse stack leads to n conicting actions n new

stacks are created that continue the parse with those actions These stacks are

not copies of the old stack The new top no des have p ointers to the old stack If

Exp

Exp Exp

Exp "*" Exp

Exp * "+" Exp "*" Exp

a + b * c

Figure Parse forest with sharing for ambiguous string abc

in a later stage two stacks get into the same state the stacks are merged again

In this manner a graph structured stack is built in which parses for ambiguous

substrings are shared

A graph structured stack no de consists of a state numb er and a list of links

Each link contains a reference to a no de in the parse forest and a reference to

the previous stack

As an example of the working of a GLR parser consider the sequence of stack

congurations during parsing the string abc in Figure This is the parse that

created the parse forest in Figure The gure shows the stacks during each

cycle of the parsing algorithm After shifting a character all p ossible reductions

are p erformed and then the next character is shifted The trees p ointed to by

the stack links are abbreviated by their yield using square brackets to show the

structure The symb ol after the colon denotes the main typ e of the tree at the

link We consider the congurations one by one

a The initial stack with state is created The character a is shifted

The character a reduces to an expression using the pro duction az

Exp The symb ol is shifted

b The character reduces to the literal The character b is shifted

The character b reduces to an expression The sequence ab reduces to

an expression resulting in a link from state to state From states

and a shift can b e p erformed with the next character Because b oth

shifts lead to a stack with state a single stack is created that has links

to the two stacks

c The character reduces to the literal The character c is shifted

EOF The character c reduces to an expression Now there are two p ossible

reductions from the stack with state First reduce bc and then

reduce abc or reduce abc Both reductions result in the

creation of a stack with state with a link to the initial stack These 0 a 6

a 6 0 a : Exp + 1 10

+ 10 a : Exp 0 1 + : "+" b 7 6

b 6

a : Exp + : "+" b : Exp 1 7 12 *

0 * 9 [a+b] : Exp 1

*

+ : "+" b : Exp * : "*" a : Exp 1 7 12 9

0 [a+b] : Exp * c 1 8 6

* : "*"

* : "*" c [a+b] : Exp 1 6

* : "*" 8 c : Exp a : Exp + : "+" b : Exp 1 7 12 13 [b*c] : Exp 0 [[a+b]*c]|[a+[b*c]] : Exp 1 12 [[a+b]*c]|[a+[b*c]] :

2

Figure Parse congurations for the parse of string abc

stacks are shared and an ambiguity no de is created that represents the

two p ossible parse trees At this p oint the entire string has b een read and

the next symb ol is EOF Therefore the expression is reduced to hSTARTi

and the string is accepted The stack with state is the accepting stack

and the tree p ointed to by its link is the parse tree for the entire string : 0 4 l 19

l 19 e 28 : 0 4 l : <[a-z]+-LEX> e 11 18

e t l 19 28 38

e 18

: l : <[a-z]+-LEX> e : <[a-z]+-LEX> 0 4 11 23 [le] : <[a-z]+-LEX>

t 11 18

l e 19 28 t 38

[le] : <[a-z]+-LEX> t 11 18 t : <[a-z]+-LEX>

[let] : "let" 12 23 \32

: [let] : 0 4 10 [[le]t] : <[a-z]+-LEX>

11 [let]|[[le]t] :

\32 9 6 [let]|[[le]t] : \32

14

Figure Parse of letx when reject is ignored

Reject Reductions

In x disambiguation with reject pro ductions was intro duced in order to express

the prefer literals rule In x we outlined a pro cedure for interpreting reject

pro ductions after parsing by pruning the parse forest We would rather interpret

reject pro ductions during parsing to prevent trees containing reject pro ductions

from b eing built

To understand how this can b e achieved recall the parse forest in Figure

that shows the ambiguity that is created when parsing the substring let in

the functional expression grammar dened in x It can b e interpreted us

ing the lexical pro ductions for variables or using the reject pro duction let

VarLEX frejectg Pruning this forest causes the ambiguity no de to

b e eliminated from the parse forest thereby rejecting the reading of let as a

variable

The parse congurations for this parse in Figure show how this ambiguity

is created In the rst three congurations the letters l e and t are read e t 28 38 l 19

t 18

[le] : <[a-z]+-LEX> t : <[a-z]+-LEX> 11 23

: [let] : "let" \32 0 4 12 6 [let] :

10 [[le]t] : <[a-z]+-LEX>

11

\32 6

: 2 : [let] : "let" 0 4 12 : 22 :

x

24 18

Figure Parse of letx with reject pro duction that forbids let as a

variable

The fourth conguration is the interesting one There are three parses for the

substring let as a variable constructed with azLEX as the literal let

of the reject pro duction let VarLEX and as the literal let as part

of the let construct The reduction of the literal results in a stack with state

The reduction of the lexical and the reject rule lead to a merged stack with

state from where another reduction rst leads to a stack with state and

then leads to a term and a stack with state From the states and

parsing continues with a shift of the space character to state

The idea for the implementation of the reject rule is to forbid further actions

with a state that has b een reached using a reject reduction The link that is

created when reducing with a reject pro duction is marked as rejected If all

links of a stack are marked as rejected all shifts and reductions from that state

are forbidden

In the last conguration of Figure this would entail that the link from the

stack with state to the stack with state is rejected Therefore the reduction

to state and the shift to state would b e forbidden This is exactly what

happ ens in the parse shown in Figure The dotted link is rejected and no

actions are p erformed from its stack The parse of let as variable is preempted

In the next conguration parsing continues only with the stack with state

corresp onding to a parse of let as a literal

The Algorithm

Below the complete SGLR algorithm is presented The dierences with the

GLR algorithm of Rekers are the use of pro ductions in the goto function

and the handling of reject reductions Furthermore the parser do es not make

use of a scanner but reads characters from a le or string This could of course

b e a stream of token co des and do es not make a dierence to the algorithm As

we discussed in the x character classes are handled in the parse table and are

thus transparent to the parser

Algorithm SGLR Given the parse table table for some grammar parse

the string of characters in le If the string is a sentence in the language de

scrib ed by the grammar return the parse forest for the string and an error

message otherwise

Parse The function parse reads the characters from a le and returns a parse

tree if the text is syntactically correct an error message otherwise The list of

active stacks is initialized to contain a single stack with the initial state of the

parse table as its state For each character in the input the parser handles all

actions for each active stack The shifts for each stack are stored and p erformed

by the shifter after all p ossible reductions have b een p erformed When all char

acters have b een read or when no more stacks are alive parsing terminates If

parsing succeeded the accepting stack has a direct link to the initial state This

link has a reference to the parse forest with all p ossible parse trees for the entire

string If parsing failed an error term is returned

PARSEtable le

global acceptingstack

global activestacks fnew stack with state inittable g

do

global currenttoken getnextcharle

PARSECHARACTER

SHIFTER

while currenttoken EOF activestacks

if acceptingstack contains a link to the initial stack with tree t then

return t

else

return parseerror

Parse Character The list of active stacks is moved to the list of stacks of the

actor that p erforms the actions for a stack unless the stack is rejected The list

of stacks for the actor is extended when reductions are p erformed If actions

for newly added stacks are p erformed b efore all links to it have b een created a

stack that b ecomes rejected might escap e Therefore new stacks are added to

foractordelayed if they are rejectable and are only considered when all stacks

on foractor are exhausted Then stacks are taken from the delayed list in order

of priority The op eration p op removes the stack with the highest priority from

a list of stacks

PARSECHARACTER

global foractor activestacks

global foractordelayed

global forshifter

while foractor foractordelayed do

if foractor then

foractor fp opforactordelayed g

for each stack st foractor do

if all links of stack st rejected then

ACTORst

Actor Handle the actions for stack st and the current token A reduce action

is immediately handled Shift actions are saved on forshifter for handling if

after all reductions have b een p erformed An accept action results in saving

the current stack as the accepting stack An error action is ignored b ecause the

current stack can b e a wrong attempt while other stacks are still alive The

entire parse fails if all stacks lead to error actions This will b ecome apparent

after shifting b ecause no more active stacks will b e alive

ACTORst

for each action a actionss currenttoken do

case a of

shifts forshifter fhst sig forshifter

reduce A DOREDUCTIONSst A

accept acceptingstack st

Reductions Function doreductions p erforms a reduction for stack st with pro

duction A For each path of length jj following the links from st to some

stack st the trees along the path are collected and the reducer is called to han

dle the reduction

DOREDUCTIONSst A

for each path from stack st to stack st of length jj do

kids the trees of the links which form the path from st to st

REDUCERst gotostatest A A kids

Reducer Given a stack st a state s a pro duction A and a list of trees

kids the reducer creates the application no de for the pro duction and the list

of direct descendants kids and creates a new stack with state s and a link to

stack st However b ecause there might already exist as stack with state s the

list of active stacks is searched If there is no such stack a new stack is created

else branch and added to the list of active stacks and the list of stacks for the

actor The new stack has state s and a link with a p ointer to the newly created

tree If a stack with state s already exists and there is a direct link nl from

st to st an ambiguity has b een found The tree t is added to the ambiguity

no de of the link If there is no direct link a new link is created from st to st

with t as parse tree Because this new link entails that new reductions from

already insp ected stacks might b e p ossible all active stacks are reconsidered

In all cases the link that is created or extended is marked as rejected if the

pro duction is a reject pro duction

REDUCERst s A kids

t application of A to kids

if st activestacks statest s

if a direct link nl from st to st then

add t to the p ossibilities of the ambiguity no de at treenl

if A is a reject pro duction then mark link nl as rejected

else

add a link nl from st to st with tree t

if A is a reject pro duction then mark link nl as rejected

for each st activestacks

such that all links of st rejected

st foractor st foractordelayed do

for each reduce A actionsstatest currenttoken

do

DOLIMITEDREDUCTIONSst A nl

else

st new stack with state s

add a link nl from st to st with tree t

activestacks fst g activestacks

if rejectablestatest then

foractordelayed pushst foractordelayed

else

foractor fst g foractordelayed

if A is a reject pro duction then mark link nl as rejected

Limited Reductions The function doreductions is used to do all reductions for

some state and pro duction that involve a path going trough link nl

DOLIMITEDREDUCTIONSst A l

for each path from stack st to stack st of length jj going through link l

do

kids the trees of the links that form the path from st to st

REDUCERst gotostatest A A kids

Shifter After all p ossible reductions have b een p erformed forshifter contains

a list of stacks that can do a shift Only these stack make it into the next cycle

of the parse The list of active stacks is reinitialized to the empty list For each

stack st in forshifter a new stack is created with a link to st with as tree the

current token That is if a stack with state s was already created only a link

from that stack to st is created

SHIFTER

activestacks

t currenttoken

for each hs st i forshifter do

if st activestacks statest s then

add a link from st to st with tree t

else

st new stack with state s

add a link from st to st with tree t

activestacks fst g activestacks

end

A AminB

A

A AandB

AandB Start

B

A AandB

B AminB frejectg

AminB AandB frejectg

A AminB

AminB

AminB AandB frejectg

B AminB frejectg

AandB

AandB Start

Figure Goto graph for grammar with nested reject pro ductions

Remarks

The algorithm ab ove do es not actually mark stacks as rejected but the link

from a stack that is created with a reject pro duction Further action on a stack

is forbidden if all links from that stacks are rejected This is done b ecause in

principle there could b e situations where two links are created from the same

stack that are not merged as is the case when the links are to the same stack

and only one is rejected It is not clear whether such a situation can o ccur But

there is no pro of of the contrary either

The ordering on states that is assumed in the priority p op op eration used in

pro cedure PARSECHARACTER is needed to ensure that nested reject pro

ductions are treated prop erly For example consider again a grammar extended

with pro ductions expressing the intersection of sorts A and B

A AminB A AandB

B AminB reject AminB AandB reject AandB Start

This gives rise to the goto graph in Figure States and are rejectable

b ecause they can b e reached with a reject pro duction When parsing a string

that is in A and in B state is reached using the reduction for B The next

reduce action with the reject pro duction B AminB reject leads a stack

with state which is rejected No further action is taken from that stack The

reduction of A AandB leads to a stack with state and then correctly to

acceptance of the string

Now consider the case where a string is in A but not in B Then there is no

reduction to state and hence state is not rejected but there is a reduction

to state using A AminB and a reduction to state using A AandB Now

there are two rejectable stacks on the foractordelayed list If the stack with

state is released rst a reduction with AminB AandB reject o ccurs and

the stack with state which is still on foractordelayed is rejected and parsing

fails as it should However if state is released rst parsing succeeds b efore

the stack with state is rejected It is clear that in this case state has higher

priority than state

It is not clear how the ordering on states should b e determined in general It

would seem that a state s with pro ductions that are reachable from the pro duc

tions in a state s has higher priority This is only a guess however and should

b e worked out more carefully For single ie nonnested reject pro ductions the

ordering plays no role Therefore the implementation of exclusion by means of

reject pro ductions of which prefer literals is a sp ecial case is not dep endent on

nding an ordering on states

Implementation

In the previous sections we have presented an approach to scannerless parsing

These techniques are implemented as part of the SDF to ols The to ols have

b een used to construct parsers for a numb er of languages including SDF it

self Although no detailed data on the p erformance of the implementation are

available at the time of this writing a couple of preliminary observations can

b e made nonetheless

Grammar Normalizer The syntax denition formalism SDF is completely

sp ecied in ASFSDF Part of the denition is the grammar normalizer dis

cussed in x This sp ecication has b een compiled to an executable term

rewriter which has a reasonable p erformance The literate sp ecication of SDF

and the normalization of syntax denitions is presented in Visser c The

sp ecication also denes the format of parse trees enco ded in the ATerm format

of Van den Brand et al

Parser Generator The parser generator describ ed in x has b een completely

sp ecied in ASFSDF The compiled sp ecication of the parser generator is

to o inecient It is probably necessary to implement this comp onent in an

imp erative language that allows direct access instead of lo okup in lists

There are several factors that make parser generation more dicult com

pared to normal SLR parser generation for contextfree grammars There

are more itemsets b ecause of the pro ductions for the lexical syntax Extra

pro ductions are added b ecause of the reject pro ductions expressing the prefer

literals rule this increases the numb er of items in itemsets The goto table

contains a transition for each pro duction instead of a transition for each non

terminal The last factor can b e reduced by sharing transitions to the same

state

Pro ductions and itemsets are enco ded by numb ers Character classes are

imp ortant for reducing the size of the parse table A set of actions that is

shared by several characters is stored eciently by means of a character class

ie actions is a mapping from itemsets and character classes to sets of actions

Parser The SGLR parsing algoritm has b een implemented in C The im

plementation makes use of the C implementation of ATerms Van den Brand

et al to represent stacks and trees

The parser includes visualization to ols for parse forests and graph structured

parse stacks that were used to pro duce the pictures in this pap er The forest

vizualization might b e used as basis for an interactive disambiguation to ol

The C implementation of the SGLR parsing algorithm seems reasonably

ecient although sharing of trees can b e improved Output of parse trees is

not optimal b ecause sharing of subtrees is completely lost when writing out a

parse forest in a linear term format This can solved by using a linear enco ding

of graphs such as the graph exchange language GEL of Kamp erman

Furthermore a markscan garbage collector for stack and tree no des is used

This entails that all stack and tree no des are visited on a collect which is to o

exp ensive since a large amount of the heap will not change status A reference

count garbage collector should make a dierence

Complexity of Lexical Analysis We have p erformed a few exp eriments to

get an idea of the complexity of lexical analysis with scannerless generalized

LR parsers The exp eriments were based on the simple expression grammar

in x The exp eriments that were p erformed were of the form a Parsing a

single identier of increasing length up to KB b Parsing an expression

consisting of ten additions with identier arguments of increasing length up

to KB c Parsing an expression consisting of an increasing numb er of

additions up to K arguments with length KB

For all these exp eriments we saw an almost linear b ehaviour for small les

deteriorating to square b ehaviour for the large les However when garbage

collection was turned o this b ehaviour changed into linear for all exp eriments

This conrms the observation ab out the inappropriateness of the garbage collec

tion algorithm It also conrms the idea that lexical analysis will b ehave linearly

for simple ie regular lexical syntax The prototyp e implementation should b e

further optimized b efore its p erformance can meaningfully b e compared to scan

nerparser combinations such as LEXYACC Nonetheless these exp eriments

show the feasibility of the scannerless generlizedLR parsing approach

Related Work

The syntax denition formalism SDF is formally sp ecied in Visser c

The sp ecication in ASFSDF comprises the syntax of the formalism the nor

malization pro cedure and the parse tree format dened by a grammar

The syntax denition formalism SDF of Heering et al was the start

ing p oint for the work discussed in this pap er The denition of SDF grew out

of the sp ecication of SDF in ASFSDF A numb er of generalizations where

applied to make the formalism more orthogonal and uniform and a numb er

of improvements and new features were added based on the exp erience with

SDF in the ASFSDF MetaEnvironment Klint SDF intro duced the

integration of lexical syntax and contextfree syntax but only at the formal

ism level In the implementation an SDF denition is mapp ed to a regular

grammar dening the lexical syntax and a contextfree grammar denining the

contextfree syntax The scanners pro duced for the lexical syntax yield a graph

structured stream of all p ossible tokenizations of the input ltered by a set

of lexical disambiguation rules Although this is a fairly advanced setup the

interface suers from several of the problems that we discussed in x and x

The generalizedLR parsing algorithm was rst develop ed by Tomita

for application in natural language pro cessing It was later improved by Rekers

and applied in the ASFSDF MetaEnvironment for parsing of program

ming languages The algorithm presented in x is based on Rekers version

Wagner and Graham describ e the use of GLR parsing in incremental

parsing of programming languages Earley describ ed the rst generalized

parsing algorithm that is closely related to the LR algorithm of Knuth

A more recent approach to parsing with dynamic lo okahead is the extension of

topdown parsing with syntactic predicates of Parr and Quong

Scannerless parsing was intro duced by Salomon and Cormack

They dene an extension of SLR parsing in which the lack of lo okahead

is repaired by extending itemsets if conicts are found This noncanonical

SLR parser generation works only for a limited set of grammars making

grammar development dicult The follow restrictions presented in this pap er

are a simplication of the adjacency restriction rule of the NSLR approach in

which arbitrary grammar symb ols can b e forbidden to b e adjacent Our reject

pro ductions are called exclusion rules by Salomon and Cormack

We have presented a complete implementation for follow restrictions and reject

pro ductions whereas the adjacency restrictions and exclusion rules are only

partially implemented in NSLR parsing

A similar approach using GLR parsing is tried in the area of natural language

pro cessing Tanaka et al discuss the integration of morphological and

syntactic analysis of Japanese sentences in a single GLR parser The morpholog

ical rules describ e how words can b e formed from characters Segmentation of

a string of characters into a string of words is guided by a connection matrix re

stricting the categories that can b e adjacent in a sentence These rules do usually

not suce to nd an unambiguous segmentation by integrating morphological

comp osition into the contextfree grammar of the syntactic phase such contex

tual ambiguities can b e avoided This creates the problem of disambiguating

the combined contextfree grammar usingthe morphological connection matrix

This is partly done as a lter on the generated LR table and partly dynamically

during parsing

Disambiguation by means of priority and asso ciativity declarations was in

tro duced simultaneously by Aho et al and Earley The former

describ e the solution of conicts in LR parse tables by means of a restricted form

of priorities Aasa describ es the solution of LR table conicts by

means of precedence declarations Thorup a b describ es the

solution of parse table conicts by means of a collection of excluded subtrees

The metho d is more expressive than the priorities of SDF but only succeeds if

all conicts are solved which is not guaranteed

In Klint and Visser logical disambiguation metho ds are formalized as

disambiguation lters on sets of parse trees Based on this approach an ecient

implementation of disambiguation by priorities is derived in Visser a from

the disambiguation lter for priorities This derivation forms the foundation for

the parser generator algorithm presented in this pap er

Conclusions

In this pap er we have presented a new approach to parsing that has several ad

vantages over conventional techniques It overcomes the drawbacks of the tra

ditional scannerparser interface by ab olishing the scanner completely hence

the name scannerless parsing The lexical and contextfree syntax of a lan

guage are describ ed in a single integrated uniform grammar formalism Lexical

ambiguities can frequently b e solved by means of the parsing context Lexical

structure and layout are preserved in the parse tree and thus accessible in se

mantic to ols A more expressive formalism for lexical syntax is obtained such

that for example nested comments can b e expressed

The approach encompasses an expressive syntax denition formalism A

grammar normalizer to reduce the complexity of the formalism by simplifying

syntax denitions to contextfree grammars with a few extensions An SLR

parser generator that deals with characterclasses follow restrictions and pri

ority and asso ciativity rules A generalizedLR parser that can b e used for

arbitrary contextfree grammars with reject pro ductions at least if they are not

nested In parsing unambiguous languages the GLR parser is used to dynami

cally handle lo okahead problems by forking o parsers in parallel

Reject pro ductions turn out to b e a very expressive device that brings us out

of the domain of contextfree languages It is as yet unclear how expressive this

formalism is exactly but we have a lower b oundstronger than contextfree

n n n

b ecause it describ es a b c and an upp er b ound b ecause the parsing problem

is decidable

Priorities are compiled into the parse table such that no parse trees with

priority conicts can b e pro duced by the parser This reduces the size of the

parse forest in case of ambiguous binary expressions the parse forest grows

exp onentially and decreases the numb er of paths in the graph structured stack

The technique is more general than conventional techniques for this kind of

disambiguation and works even if there remain conicts in the parse table due

to other causes For instance if the grammar requires more lo okahead than the

parser generator provides

An op en issue is the fully automatic derivation of lexical disambiguation

rules from the grammar that would make the metho d still easier to use Apart

from this minor p oint scannerless generalizedLR parsing is a feasible parsing

metho d that makes syntax denition more expressive and solves a numb er of

problems with conventional parsing approaches

Acknowledgments The author thanks Arie van Deursen Jan van Eijck An

nius Gro enink and Paul Klint for useful suggestions and comments on previous

versions of this pap er

This research was supp orted by the Netherlands Computer Science Research

Foundation SION with nancial supp ort from the Netherlands Organisation

for Scientic Research NWO Pro ject Incremental parser gener

ation and contextsensitive disambiguation a multidisciplinary p ersp ective

References

Aasa A Precedences in sp ecications and implementations of program

ming languages In J Maluzynski and M Wirsing editors Programming Lan

guage Implementation and Logic Programming volume of Lecture Notes

in Computer Science pages SpringerVerlag

Aasa A User Dened Syntax PhD thesis Department of Computer

Sciences Chalmers University of Technology and University of Goteb org S

Goteb org Sweden

Aho A V Johnson S C and Ullman J D Deterministic parsing of

ambiguous grammars Communications of the ACM

Anderson T Eve J and Horning J Ecient LR parsers Acta

Informatica

Billot S and Lang B The structure of shared forests in ambiguous pars

ing In Proceedings of the TwentySeventh Annual Meeting of the Association

for Computational Linguistics Asso ciation for Computational Linguistics

Van den Brand M G J and Visser E Generation of formatters

for contextfree languages ACM Transactions on Software Engineering and

Methodology

Van den Brand M G J Klint P Olivier P and Visser E ATerms

Representing structured data for exchange b etween heterogeneous to ols Tech

nical rep ort Programming Research Group University of Amsterdam

DeRemer F L Simple LRk grammars Communications of the ACM

Earley J An ecient contextfree parsing algorithm Communications

of the ACM

Earley J Ambiguity and precedence in syntax description Acta Infor

matica

Van Eijck J Email july

Floyd R W Syntactic analysis and op erator precedence Communica

tions of the ACM

Heering J Hendriks P R H Klint P and Rekers J The syntax

denition formalism SDF reference manual SIGPLAN Notices

Jensen K and Wirth N PASCAL User Manual and Report volume

of Lecture Notes in Computer Science SpringerVerlag Berlin second edition

edition

Kamp erman J F T GEL a graph exchange language Technical

Rep ort CSR CWI Amsterdam

Klint P A metaenvironment for generating programming environ

ments ACM Transactions on Software Engineering and Methodology

Klint P and Visser E Using lters for the disambiguation of context

free grammars In G Pighizzini and P San Pietro editors Proc ASMICS

Workshop on Parsing Theory pages Milano Italy Tech Rep

Dipartimento di Scienze dellInformazione Universita di Milano

Knuth D E On the translation of languages from left to right Infor

mation and Control

Lang B Deterministic techniques for ecient nondeterministic parsers

In J Lo eckx editor Proceedings of the Second Col loquium on Automata Lan

guages and Programming volume of Lecture Notes in Computer Science

pages SpringerVerlag

Parr T J and Quong R W Adding semantic and syntactic predicates

to LLk predLLk In P A Fritzson editor Compiler Construction

th International Conference CC volume of LNCS pages

Edinburgh UK SpringerVerlag

Rekers J Parser Generation for Interactive Environments PhD the

sis University of Amsterdam ftpftpcwinlpubgip erep ortsRekpsZ

Salomon D J and Cormack G V Scannerless NSLR parsing of

programming languages SIGPLAN Notices

Salomon D J and Cormack G V The disambiguation and scanner

less parsing of complete characterlevel grammars for programming languages

Technical Rep ort Department of Computer Science University of Man

itoba Winnip eg Canada

Tanaka H Tokunga T and Aizawa M Integration of morphological

and syntactical analysis based on GLR parsing In H C Bunt and M Tomita

editors Recent Advances in Parsing Technology pages Kluwer Aca

demic Publishers Dordrecht The Netherlands

Thorup M Ambiguity for incremental parsing and evaluation Technical

Rep ort PRGTR Program Research Group Oxford University Oxford

UK

Thorup M a Controlled grammatic ambiguity ACM Transactions on

Programming Languages and Systems

Thorup M b Disambiguating grammars by exclusion of subparse trees

Technical Rep ort Dept of Computer Science University of Cop en

hagen Denmark

Tomita M Ecient Parsing for Natural Languages A Fast Algorithm

for Practical Systems Kluwer Academic Publishers

Visser E a A case study in optimizing parsing schemata by disambigua

tion lters In S Fischer and M Trautwein editors Proceedings Accolade

pages Amsterdam The Dutch Graduate Scho ol in Logic

Visser E b A family of syntax denition formalisms In M G J

van den Brand et al editors ASFSDF A Workshop on Generating

Tools from Algebraic Specications pages Technical Rep ort P

Programming Research Group University of Amsterdam

Visser E a A case study in optimizing parsing schemata by disam

biguation lters In A Nijholt editor International Workshop on Parsing

Technology IWPT Boston USA To app ear

Visser E b Character classes Technical Rep ort P Programming

Research Group University of Amsterdam

Visser E c A family of syntax denition formalisms Technical Rep ort

P Programming Research Group University of Amsterdam

Wagner T A and Graham S L Incremental analysis for real program

ming languages SIGPLAN Notices Pro c of the ACM

SIGPLAN Conferene on Programming Language Design and Implementation

PLDI

Technical Rep orts of the Programming Research Group

Note These rep orts can b e obtained using the technical rep orts overview on

our WWW site URL httpwwwwinsuvanlresearchprogreports

or using anonymous ftp to ftpwinsuvanl directory

pubprogrammingresearchreports

P L Mo onen A Generic Architecture for Data Flow Analysis to Support

Reverse Engineering

P B Luttik and E Visser Specication of Rewriting Strategies

P JA Bergstra and MPA Sellink An Arithmetical Module for Ratio

nals and Reals

P E Visser Character Classes

P E Visser Scannerless generalizedLR parsing

P E Visser A family of syntax denition formalisms

P MGJ van den Brand MPA Sellink and C Verho ef Generation

of Components for Software Renovation Factories from Contextfree

Grammars

P PA Olivier Debugging Distributed Applications Using a Coordination

Architecture

P HP Korver and MPA Sellink A Formal Axiomatization for Alpha

bet Reasoning with Parametrized Processes

P MGJ van den Brand MPA Sellink and C Verho ef Reengineering

COBOL Software Implies Specication of the Underlying Dialects

P E Visser Polymorphic Syntax Denition

P MGJ van den Brand P Klint and C verho ef Reengineering needs

Generic Programming Language Technology

P PI Manuel ANSI Cobol III in SDF an ASF Denition of a YK

Tool

P PH Ro denburg A Complete System of Fourvalued Logic

P SP Luttik and PH Ro denburg Transformations of Reduction Sys

tems

P MGJ van den Brand P Klint and C Verho ef Core Technologies

for System Renovation

P L Mo onen Data Flow Analysis for Reverse Engineering

P JA Hillebrand Transforming an ASFSDF Specication into a Tool

Bus Application

P MPA Sellink On the conservativity of Leibniz Equality

P TB Dinesh and SM Uskudarl Specifying input and output of visual

languages

P TB Dinesh and SM Uskudarl The VAS formalism in VASE

P JA Hillebrand A smal l language for the specication of Grid Proto

cols

P JJ Brunekreef A transformation tool for pure Prolog programs the

algebraic specication

P E Visser Solving type equations in multilevel specications prelim

inary version

P PR DArgenio and C Verho ef A general conservative extension the

orem in process algebras with inequalities

Pb JA Bergstra and MPA Sellink Sequential data algebra primitives

revised version of P

P E Visser Multilevel specications

P MGJ van den Brand P Klint and C Verho ef Reverse engineering

and system renovation an annotated bibliography

P JA Bergstra and MPA Sellink Sequential data algebra primitives

P PA Olivier Embedded system simulation testdriving the ToolBus

P JJ Brunekreef TransLog an interactive tool for transformation of

logic programs

P JA Bergstra JA Hillebrand and A Ponse Grid protocols based on

synchronous communication specication and correctness

P PH Ro denburg Termination and conuence in innitary term

rewriting

P JA Bergstra and Gh Stefanescu Network algebra with demonic re

lation operators

P JA Bergstra CA Middelburg and Gh Stefanescu Network algebra

for synchronous and asynchronous dataow

P E Visser A case study in optimizing parsing schemata by disambigua

tion lters

P MGJ van den Brand and E Visser Generation of formatters for

contextfree languages

P JMT Romijn Automatic analysis of term rewriting systems proving

properties of term rewriting systems derived from AsfSdf specica

tions