Eindhoven University of Technology

MASTER

Disambiguation mechanisms and disambiguation strategies

ten Brink, A.P.

Award date: 2013

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Department of Mathematics and Computer Science Den Dolech 2, 5612 AZ Eindhoven P.O. Box 513, 5600 MB Eindhoven The Netherlands www.tue.nl

Supervisor prof. dr. Mark van den Brand (TU/e) Disambiguation mechanisms and Section disambiguation strategies Model Driven Software Engineering

Date August 23, 2013 Master’s Thesis

Alex P. ten Brink

Where innovation starts

Abstract

It is common practice for engineers to express artifacts such as domain specific languages using ambiguous context-free grammars along with separate disambiguation rules, rather than to use equivalent unambiguous context-free grammars, as these ambiguous grammars often capture the intent of the engineer more directly. In this thesis we give an overview of the landscape of these disambiguation rules as described in the literature. In order to make proper comparisons between publications, we define a parser-technology independent framework for describing work done on disambiguation, using the notions of a disambiguation mechanism and a disambiguation strategy. We investigate for a number of disambiguation mechanisms what grammars they can dis- ambiguate, both relative to each other and in absolute terms: for example, we prove that the LR shift-reduce conflict resolution mechanism (used in the well-known tools Yacc and Bison) correctly disambiguates expression grammars with unary and binary operators. To the best of our knowledge, previous work only considered the correctness of this mechanism on specific grammars. We define six quality measures on disambiguation strategies and compare the strategies intro- duced in the literature to each other on these measures. Finally, we introduce our own strategy based on regular rewriting rules. Our strategy scores well on all but one of our quality mea- sures. Its weakness is that it may not succeed on all grammars. Our strategy is very extensible however, and we give multiple options to improve its applicability in future work.

iii

Acknowledgements

I would first like to thank my parents for their unconditional support, not only while I was working on my thesis but throughout my academic career (and before that) – I would not have gotten to where I am without them. I would also like to thank the friends I made over the past five years for making my student life fun rather than a chore. In particular I thank Sander and Quirijn, who made even the contents of courses fun to work through, and Ana, for making the time I spent at the TU working on this thesis that much more enjoyable. I would like to thank Mark van den Brand for his guidance, feedback and insights, both on this thesis and on related work. The other major influence on this thesis and related work is Elisabeth Scott, whom I’d like to thank for her feedback, ideas and time, and in particular her remarks whenever I was not being precise enough – though I fear I may get a few more still. I would also like to thank Adrian Johnstone for reviewing part of this thesis. Finally, I would like to thank Tom Verhoeff and Kevin Buchin for serving on my examination committee.

Alex P. ten Brink

Eindhoven August 23, 2013

v

Contents

Abstract iii

Acknowledgements v

1 Introduction 1

2 Preliminaries 5

3 Disambiguation Mechanisms 9 3.1Introduction...... 9 3.2DisambiguationMechanismsIntroducedintheLiterature...... 10 3.3ComparisonofDisambiguationMechanisms...... 13

4 Applying Disambiguation Mechanisms to Expression Grammars 17 4.1Real-WorldExampleofaHardDisambiguationProblem...... 17 4.2 Expression Grammars ...... 20 4.3 Defining Precedences and Associativities ...... 21 4.4DisambiguatingExpressionGrammarswithBinaryOperators...... 23 4.5AddingUnaryOperators...... 25

5 Disambiguation Strategies 29 5.1Introduction...... 29 5.2DisambiguationStrategiesIntroducedintheLiterature...... 30

6 A rewriting-based disambiguation strategy 35 6.1Introduction...... 35 6.2 Systems of Regular Expressions ...... 38 6.3 Rewriting Rules for Regular Expressions ...... 38 6.3.1 DefiningRewriters...... 38 6.3.2 OurRewriters...... 39 6.4StructuredRegularExpressions...... 42 6.5RewritingRulesforStructuredRegularExpressions...... 44

vii 6.5.1 DefiningRewriters...... 44 6.5.2 Enforcing Filters ...... 44 6.5.3 OurRewriters...... 45

7 Implementation 49

8 Conclusion 53

viii Chapter 1

Introduction

Context-free grammars are widely-used tools for describing the syntactic structure of program- ming languages. They are often used to describe the second step in the analysis of an input program, after (optional) tokenization and before semantic analysis. An advantage of this de- scription is that for any context-free grammar it is possible to automatically generate a program, called a parser, that can compute this syntactic structure from an input, represented as a . A disadvantage is that some context-free grammars contain ambiguities: a single input may be parsed to multiple parse trees. For programming languages it is desirable that a program has only one interpretation. There- fore one would like to only use unambiguous context-free grammars. However, in some cases the most natural context-free description of a programming language is an ambiguous one. An example are arithmetic expressions: the most natural grammar for them leaves open whether 1+2∗ 3 should be parsed as (1 + 2) ∗ 3oras1+(2∗ 3). Some additional form of disambiguation is then needed: in our example we would need to specify whether ∗ has priority over +. Often, an unambiguous grammar capturing this additional disambiguation can be found, but these grammars are often much larger, less intuitive and less maintainable as they may not capture the intent of the language designer.

Mechanisms and Strategies

In practice it is therefore often preferable to use a syntactic description with two parts: an ambiguous context-free grammar along with disambiguation rules. There are many ways to achieve this separation, and many such methods have been described in the literature. We call everything involved in such a method – the allowed disambiguation rules, the way these disambiguation rules are applied in the parser, the way the language designer is helped with choosing these rules and with understanding the effects of the rules – a disambiguation strategy. Most strategies have at their core a specific mechanism that specifies how inputs should be parsed (or in some cases, multiple mechanisms). Strategies (usually) detect ambiguities in the grammar and then find the right parameters for the mechanisms they employ so the inputs are parsed as specified. As a number of such disambiguation mechanisms are very common, it is useful to analyze them separately from the strategies that use them. When we are investigating strategies, we can use this knowledge of the mechanisms it uses to immediately say something about the strategy. As an example of a strategy, consider the disambiguation strategy typical for users of Yacc or Bison. First, they create an initial grammar, generate the LR table and analyze the resulting conflicts by hand. They then specify precedence and associativity rules for their grammar that

1 Yacc or Bison then uses to resolve the conflicts in the table. The disambiguation mechanism involved in this strategy is the resolution of LR table conflicts. Note that this strategy is not really a good strategy as the user needs to do most of the work himself and the strategy may fail to work as not all ambiguities can be resolved through LR table conflict resolution. While many disambiguation mechanisms and strategies have been proposed, not much work has gone into their evaluation and comparison. For example, to the best of our knowledge the correctness of the mechanism employed by Yacc and Bison has only been considered on examples. This thesis sets some initial steps in this area by defining our parser-technology inde- pendent framework consisting of the notions of disambiguation mechanisms and disambiguation strategies and by addressing the following questions:

• What are relevant quality criteria for disambiguation mechanisms?

• Which disambiguation mechanisms have been introduced in the literature?

• How good are the disambiguation mechanisms we have found in the literature?

• What are relevant quality criteria for disambiguation strategies?

• Which disambiguation strategies have been introduced in the literature?

• How good are the disambiguation strategies we have found in the literature?

Disambiguation Mechanisms

The two relevant quality criteria for disambiguation mechanisms are their applicability (how many grammars they correctly disambiguate) and whether they can be efficiently implemented. Applicability is not entirely straightforward. It is not enough for the mechanism to be able to disambiguate a grammar, as we also want that the mechanism returns the desired parse tree for any input, and not just any parse tree. The difference is clear when looking at expression grammars: we want to be able to specify the priorities of the operators and have the mechanism return the parse tree that respects this. We investigate the applicability of disambiguation mechanisms in two ways. The first consid- ers relative applicability by investigating whether some mechanisms are strictly more powerful than others. The second considers absolute applicability by picking an interesting set of gram- mars and investigating which mechanisms are applicable to all these grammars. In this thesis, we choose a set of expression grammars with unary and binary operators for our investigation. We perform this evaluation for three of the mechanisms we describe. Out of the mechanisms we list, we investigate the applicability of the following disambiguation mechanisms: precede and follow restrictions, LR shift-reduce conflict resolution and forbidden patterns. Their relative applicabilities turn out to be incomparable: for any selection of these mechanisms, there is a grammar for which none of the mechanisms in the selection work, but any one of the mechanisms that were not selected does work. In the category of absolute applicability, we prove that they can all disambiguate expression grammars with unary and binary operators, with the exception that forbidden patterns can only deal with binary operators and not (all) unary operators. The above considerations on disambiguation mechanisms can be found in Chapter 3. The investigations of the absolute applicabilities of the mechanisms can be found in Chapter 4.

2 Disambiguation Strategies

For disambiguation strategies we propose three categories of criteria: applicability, ease of use and implementation efficiency. Applicability not only consists of on how many grammars the strategy works, but also whether the desired parse trees are returned. We divide ease of use into three parts: whether all ambiguities are detected, whether there is some guarantee that the language of the grammar is preserved and whether the user can easily understand the effects of the options that the strategy offers him (for example, is the user helped with understanding what happens if we remove a reduce option in an LR table). Note that a large part of the three criteria are dependent on the underlying mechanisms: for example, if a strategy uses a mechanism with limited applicability, it will have limited applicability itself as well. We propose six quality measures, namely ‘does the strategy guarantee that the language of the grammar is preserved?’, ‘does the strategy guarantee no ambiguity remains in the grammar?’, ‘can the disambiguation rules be efficiently applied?’, ‘can the user choose between different disambiguations?’, ‘is it easy for the user to understand the consequences of his choices?’ and ‘in how many cases does the strategy work?’. We evaluate previously published disambiguation strategies based on these measures. The above considerations on disambiguation strategies are found in Chapter 5.

A Strategy Based on Rewriting Rules

Finally, we present our own strategy based on rewriting rules. We observe that the language described by an expression grammar with binary and unary operators is actually a regular language. We generalize the algorithm used to convert regular grammars to regular expres- sions, which is an algorithm that repeatedly applies rewriting rules, by adding more rewriting rules, allowing it to work on for example expression grammars. As many properties of regular expressions, such as ambiguity and language equality, admit decision procedures, we obtain two powerful tools for our strategy. We define structured regular expressions, that preserve the structure of the original grammar. In a second algorithm we rewrite context-free grammars to such structured regular expressions. During this second rewriting, we introduce disambiguation rules and finally we check the expression for ambiguity. We end up with two regular expressions: the first preserves the language of the original grammar and the second is unambiguous and adheres to the disambiguation rules. We compute whether their languages are the same, in which case the disambiguation rules have fully disambiguated the grammar without removing words from its language. Our strategy scores highly on all but one of our quality measures. Its only weakness is that it may not work on all ambiguous grammars encountered in practice. We describe several options for extending our strategy to overcome this problem. Firstly, we can introduce new rewriting rules to our rewriting algorithms, allowing it to rewrite more grammars. Secondly, we note that language equality is decidable on types of grammars other than regular grammars [7, 16, 17, 18, 29, 32]: even if our rewriters do not manage to rewrite the grammar into a regular expression, they may still detect and remove enough ambiguities for the grammar to become unambiguous and susceptible to a language equality procedure. Thirdly, we note that some filtering techniques have been developed [6, 24], which may help to remove parts of the grammar that do not cause ambiguity and obstruct our rewriters due to non-regularity. We describe a ‘basic’ version of our strategy in this thesis, describing rewriters that can turn any regular grammar or expression grammar into a regular expression and any regular grammar into a structured regular expression – defining a rewriter that turns expression grammars into

3 structured regular expressions turned out to be so complicated that we did not manage to include it in full generality in this thesis. We do define a rewriter that forces any structured regular expression to adhere to a set of (specific) disambiguation rules, which would be a useful component of such a rewriter for expression grammars. Our strategy can be found in Chapter 6. Finally, we present our mostly finished proof-of-concept implementation of our strategy. The implementation does not completely implement our basic strategy as outlined in this thesis: it does not suggest disambiguation rules nor enforces them in structured regular expressions. The proof-of-concept has been useful in finding the right rewriting rules for our strategy, and as part of future work may be a useful basis to implement an advanced version of our strategy. The details of this implementation can be found in Chapter 7.

4 Chapter 2

Preliminaries

We assume basic familiarity with strings and context-free grammars. For a thorough introduc- tion to these topics, see any basic book on theory, for example [4]. In order to follow the proofs of our results on LR shift-reduce disambiguation, in particular Lemma 9, we assume the reader knows about the construction of LR automata, LR tables and the LR algorithm in general. This knowledge can be found in for example in the aforementioned book [4], but is not needed to follow the rest of the thesis.

Context-free Grammars

A context-free grammar is a quadruple G =(T,N,S,P). T is a set of terminals, N a set of nonterminals (T ∩ N = ∅), S ∈ N the starting symbol, V = N ∪ P the set of vocabulary symbols and P ⊆ N × V ∗ is a set of productions,where∗ is the Kleene star operation. If (A, α) ∈ P ,we call A the left-hand side and α the right-hand side of P . We will use the standard notational convention that lowercase Latin letters a,b,c,... are terminals, uppercase Latin letters A,B,C,... are nonterminals and Greek letters α,β,γ,... are strings of vocabulary symbols. If no confusion can arise, we sometimes implicitly use a variable such as i, j, k, l, m, for example ai – these variables should be understood as ranging over the allowed values at that position. We denote a production as A → α (where α may be a replaced by a more specific string of vocabulary symbols). ε denotes the empty string in V ∗. We denote concatenation of strings by juxtaposition and denote the Kleene star operation by ∗. ∗ We define the derives relation ⇒ as follows: if A → α ∈ P ,thenβAγ ⇒ βαγ.Let⇒ be ∗ the transitive closure of ⇒. SF(G)={α ∈ V ∗ | S ⇒ α} is the set of sentential forms of the grammar and L(G)=T ∗ ∩ SF(G)thelanguage of the grammar.

Derivation Trees

We now define the notion of a derivation tree. PTG will denote the set of derivation trees with respect to G – we drop the subscript if G is clear from the context. Derivation trees are ordered trees whose leaf nodes are labeled with a terminal or ε, whose internal nodes are labeled with a nonterminal and for which it holds that if an internal node n labeled A has children labeled x1,...,xk (in that order), then (A, x1 ...xk) ∈ P –ifx1 ...xk = ε the node n has a single child labeled ε. We denote the root of t ∈ PT as root(t). Let t ∈ PT and let n be an internal node of t labeled A with children labeled x1,...,xk (in that order), then head(n)=A and prod(n)=A → x1 ...xk. We require that for a derivation tree t, head(root(t)) = S,whereS is the starting symbol of G.

5 We denote the set of nodes of a derivation tree t as ST(t). As derivation trees are ordered trees, we can use several well-known terms: in particular, we will talk about children and descendants. We will call a sequence n1,...,nk of nodes of a derivation tree t a chain of direct descendants if ni is a child of ni−1 for all 1

Operations on Derivation Trees

Let t ∈ PT.Forn ∈ ST(t) we define the yield y(n) as follows. If n is a leaf node, y(n)isthelabel of n.Ifn is an internal node with children x1,...,xk (in that order) then y(n)=y(x1) ...y(xk). We define the yield of t to be the yield of root(t). Note that {y(t) | t ∈ PT} = L(G). For every n ∈ ST(t) we define the extent e(n)=(in,jn) of that node, which is a pair of integers. We order the leaf nodes of t as n1,...,nl through an in-order traversal of the tree. We now define e(n) inductively. Set in1 := 0. For all 1 ≤ x ≤ l,ifnx is labeled ε, we set jnx := inx , and we set j i ≤ x

First(n)=win+1, Last(n)=wjn otherwise, and Precede(n)=win and Follow(n)=wjn+1. For a production p ∈ P , we define the first set First(p)andlast set Last(p)asthesetsof First and Last values for all nodes n with prod(n)=p over all t ∈ PT. Note that there exist methods to compute these sets efficiently [27].

Tree Patterns

We define the set of tree patterns TP. Tree patterns are also ordered trees whose leaf nodes are labeled with either a terminal, ε or a nonterminal (making them different from derivation trees), whose internal nodes are labeled with a nonterminal and for which it holds that if an internal node labeled A has children x1,...,xk (in that order), then (A, x1 ...xk) ∈ P .Wesay that p ∈ TP matches t ∈ PT if p has a label-preserving homomorphism to t – that is, a function h from ST(p)toST(t) exists such that if a node n in p has children n1,...,nk (in that order), then h(n) also has exactly children h(n1),...,h(nk) (in that order). Let w ∈ L(G). We define T (w)={t ∈ PT | y(t)=w}.If|T (w)| > 1wesaythatw has multiple parses. If some w ∈ L(G) has multiple parses then G is said to be ambiguous.

6 Tree Rotations

Finally, we define the well-known concept of a tree rotation for a derivation tree [9, p. 315-317]. There are two tree rotation operations: the left rotation and the right rotation. They work as follows: A A A A right left β α α β A ⇒ A , A ⇒ A αA A β A β αA

In these diagrams, A is some nonterminal, α and β are a sequence of children and A is a node labeled A that is allowed to have further children. The patterns may occur anywhere in the derivation tree, so the roots of these patterns may themselves be children of other nodes in the derivation tree. Note that rotations are yield-preserving.

7

Chapter 3

Disambiguation Mechanisms

3.1 Introduction

Throughout this section G shall denote a context free grammar. We now introduce the concept of a disambiguation filter, which is an adapted version of the original definition by Visser [33]. A disambiguation filter F is a function from sets of derivation trees to sets of derivation trees with the property that if Φ is a set of derivation trees, then F(Φ) ⊆ Φ. Given a set of derivation trees we define F(Φ,w)={t ∈F(Φ) | y(t)=w}. We say a filter is completely disambiguating if for all w ∈ L(G)wehave|F(PT,w)|≤1. We say a filter is correctly disambiguating if for all w ∈ L(G)wehave|F(PT,w)|≥1. We say a function D from L(G)toPT is a goal if y(D(w)) = w for all w ∈ L(G). Given a goal D we say that a filter F implements D if for all w ∈ L(G)wehaveF(PT,w)={D(w)}. Note that this means that F is correctly and completely disambiguating. A disambiguation mechanism is a function that takes a grammar and auxiliary data relating to the grammar and results in a filter for that grammar. We do not specify the nature of the auxiliary data on purpose, to allow maximal flexibility. Note that the disambiguation mechanisms and the resulting filters that we will define in this thesis are almost all parser technology independent: we only define mathematically how to disambiguate, but we leave open how to implement the filter. We do consider possible implementations to assess the efficiency of the mechanisms. The two interesting criteria for disambiguation mechanisms are applicability and efficiency.A disambiguation mechanism is efficient if we can generate fast parsers that implement the filters it generates. The applicability of a mechanism corresponds with whether it can implement many interesting goals for interesting grammars. We can investigate this in two different ways. The first way is to find a grammar G such that mechanism A can completely and correctly disambiguate G, while mechanism B cannot, or prove that no such grammar exists. This gives insight into the relative applicability of mechanisms A and B. The second way is to pick an interesting set of grammars and interesting sets of goals for these grammars and investigate whether a mechanism A can implement these goals for these grammars. We will look at expression grammars with binary and unary operators, and goals corresponding to all possible combinations of precedences and associativities. This gives insight into the absolute applicability of a mechanism A. Note that for absolute applicability, we are not satisfied if our mechanism is correct and complete: we pick specific goals that we want our mechanisms to implement. This models the wish of grammar engineers to have their inputs be disambiguated to the interpretation they have in mind. We can immediately rule out the possibility that a computable mechanism can

9 implement all possible goals for even a single grammar G such that there are infinitely many w ∈ L(G) with multiple parses:

Lemma 1 For any grammar G with the property that there are infinitely many w ∈ L(G) with multiple parses, there exists a goal D for this grammar such that D is uncomputable.

Proof. By a counting argument. The set of words that have multiple parses is infinite and countable, and for every such word there are at least two choices to make, so the set of goals is uncountable. As there are only countably many Turing machines, there is a goal that is uncomputable. 

We should therefore consider a set of goals that would be useful in practice and then consider whether our mechanisms can implement those goals.

3.2 Disambiguation Mechanisms Introduced in the Literature

We will now list some commonly used mechanisms described in the literature. We will also make a note of their efficiency. We note in advance that all of these mechanisms except avoid and prefer rules may not always result in filters that retain the language of the grammar.

Forbidden Patterns

The first disambiguation mechanism we will discuss is the mechanism of forbidden patterns [30]. It is often used to disambiguate expression grammars, even though, as we will show later, it is actually not a powerful enough tool to do that in all cases. Afroozeh gives several examples of forbidden patterns in his master thesis [1]. The auxiliary data is a set of tree patterns Q.The returned filter takes a set of derivation trees Φ and returns {t ∈ Φ |¬∃p ∈ Q : p matches t}. For example, consider the following grammar, using T = {+, } and N = {E}:

E → E + E E → 

Using Q = { E }

E+E

E+E the forbidden pattern disambiguation mechanism completely and correctly disambiguates the grammar. The derivation trees it does not remove are precisely the left associative ones. As it is unclear how to implement this mechanism if Q is allowed to contain infinitely many patterns, we restrict Q to be finite. In some cases, one may find an efficient implementation for specific infinite sets Q, but in this case it may be more prudent to consider this a new mechanism rather than an efficient implementation of a special case of the forbidden pattern mechanism.

10 If Q is finite, then it can be implemented in multiple ways. It is easy to do a post-processing pass. Algorithms also exist that rewrite the grammar to adhere to forbidden patterns, see for example Thorup [31], but these algorithms may exponentially increase the size of the grammar in the worst case. Finally, many parsing algorithms can be modified to never create the parse trees violating the patterns in the first place. For example, when generating an LR automaton, one can incorporate patterns with exactly two non-leaf nodes (one of which is the root) into the ‘closure’ step. This step adds items of the form A →•α to the item set if an item of the form B → β • Aγ is present in the set. The modification is then to not add A →•α if the tree

B

β A γ α is in Q. Most patterns, such as the ones we will give for expression grammars with binary operators, are of this form. Hence, we will consider forbidden patterns to be efficient.

Reject Rules

Our second disambiguation mechanism is that of reject rules [23]. It is often used to filter keywords from identifier nonterminals. The auxiliary data is a set of nonterminal-string pairs Q (note that this is therefore a different Q from the one used for the previous mechanism). The returned filter takes a set of derivation trees Φ and returns {t ∈ Φ |¬∃(A, w) ∈ Q, s ∈ ST(t):A = head(s) ∧ y(s)=w}.Often,ifQ is allowed to be infinite, Q is described by a set of rules (A, B), where B is a regular expression or even a context-free grammar (usually a nonterminal of the same grammar), forbidding A from producing any string in the language of B.IfB is regular then the result will remain context-free, but if B is only context-free the resulting recognize language may no longer be context free. As this mechanism is usually used to disambiguate lexical ambiguities and because it is not always clear how to implement it efficiently, it is less interesting for us, and so we will not study it in detail.

Precede-follow restrictions

The third mechanism that can be described easily with our definitions are the precede and follow restrictions. A similar mechanism is the adjacency restriction [23], which is used to deal with lexical ambiguities and is therefore not treated here. Adjacency restrictions inspired follow restrictions [33], which is also a mechanism used for lexical ambiguities, which in turn inspired precede and follow restrictions. This mechanism has not been researched extensively in the literature, but will turn out to be quite powerful. The auxiliary data for this pattern Q is a pair (P, F), where P and F are sets of production-terminal pairs. The returned filter takes a set of derivation trees Φ and returns the set of derivation trees such that for pairs (p, t) ∈ P , no node corresponding to the production p has t as its precede value, and similar for pairs in F and follow values. Formally, it returns the set {t ∈ Φ |¬∃(A → α, p) ∈ P, (B → β,f) ∈ F, s ∈ ST(t):(A → α = prod(s) ∧ Precede(s)=p) ∨ (B → β = prod(s) ∧ Follow(s)=f)}.Wewill denote this mechanism as PrecedeFollow. Predece-follow restrictions [8] are again amenable to a post-processing stage and to grammar rewriting, although this method of rewriting grammars has not been published yet to the best of our knowledge and current algorithms do not always manage to find a rewritten grammar. Precede restrictions are easily implemented in algorithms that ‘predict’ new occurrences of

11 production rules while parsing, such as Earley [11] or GLL [25, 26]: we simply do not perform the predict if the preceding input symbol is forbidden. Follow restrictions are also easy in Earley or GLL: we simply do not ‘reduce’ or ‘pop’ if the next terminal is forbidden. LR- based algorithms can also do this by removing reduce actions on certain lookaheads. Precede restrictions are harder in LR: we can sometimes achieve this by not adding an item A →•α in the ‘closure’ step if the incoming arrow of the LR state is a forbidden preceding terminal for A → α. Our conclusion is that we can consider precede-follow restrictions to be efficient.

LR shift-reduce conflict resolution

An example of a parser-specific disambiguation mechanism is the mechanism that involves resolving conflicts from an LR parsing table [2, 3, 28]. For every conflict in an LR parsing table we specify one of the alternative actions involved in the conflict in Q and then prune the table of all other actions involved in the conflicts. Although simple to explain informally, a formal description of this mechanism would be tedious and uninformatively complicated so we will not give one. Note that the well-known tools Yacc and Bison use this mechanism, but the method they use to specify how conflicts should be resolved is more coarse-grained and cannot specify all possible ways of removing conflicts from an LR parsing table. This mechanism of resolving conflicts will turn out to be surprisingly powerful despite its simplicity. We will denote this mechanism as ShiftReduce. Efficiency is easily seen, as a fully disambiguated LR table results in a linear time parser – except in certain corner cases where the parser does not halt on certain inputs, but this is detectable [28].

Avoid and Prefer Rules

Next, there is the disambiguation mechanism of avoid and prefer rules [8]. It specifies what to do if a particular substring of an input can be matched to the same nonterminal in multiple ways: avoided productions are removed if an alternative is present, and prefer rules remove all non-preferred alternatives. There can be cases where for example multiple alternatives are preferred; in these cases some tie-breaking mechanism is usually applied. Afroozeh gives several examples of this mechanism in his master thesis [1]. Unfortunately, there is no known method that implements this mechanism efficiently. Usually the parse trees that violate these rules are removed in a post-parse phase. This may be very inefficient if the size of the final derivation tree is much smaller than the result of the initial parse. Consider for example the (not realistic) grammar S → aS, S → SS,S → a, preferring S → aS over S → SS: there are exponentially (in n) many derivation trees with yield an (possibly compacted to a O(n3)-sized SPPF [26]), while the mechanism leaves only one derivation tree.

Other Mechanisms

The disambiguation mechanism of higher-order forbidden patterns [33] is similar to the forbidden pattern mechanism, except the patterns are allowed to be ‘of higher order’: in a tree pattern, all specified nodes are direct children of each other, but higher-order patterns allow indirect children to be specified in these patterns. Although this is a powerful mechanism, it is not clear how to implement it efficiently, so we do not investigate this mechanism in detail. Lastly, there are some more mechanisms that address types of ambiguity we are not interested in. These are the offside rule, type-information based mechanisms and heuristic mechanisms. There are also some mechanisms that load their auxiliary data from the input at parse time:

12 this is needed for example for programming languages that allow users to specify their own operators.

3.3 Comparison of Disambiguation Mechanisms

We will now have a look at the relative applicability of precede-follow restrictions, forbidden patterns and shift-reduce conflict resolution. For each of these mechanisms, we present a gram- mar on which it does not work, while the other two do work. By composing these grammars – that is, we take their union by renaming all terminals and nonterminals to unique ones per   grammar, introducing a new starting symbol S that has a production S → Si for every start symbol Si of grammar Gi – we can get grammars on which any combination of these mechanisms work, while the remaining mechanisms do not work.

Grammar 1: S → cAcAc S → cacAc A → a A → b

This first grammar cannot be disambiguated by precede-follow restrictions. The problem is that A is used in multiple locations with the same precede and follow terminals. If we therefore try to apply a precede or follow restriction to either A production, that production would also be forbidden from being a child of the A in the second half of the S productions, thus removing some words from the language. It is easy to check that the other two mechanisms do work.

Grammar 2: S → AC C → Cb C → B A → ε A → a B → ε B → a

This second grammar cannot be disambiguated by forbidden patterns. There is an ambiguity on strings of the form ab∗. The problem is that there can be a chain of C → Cb productions of unbounded size before we get to the ‘bottom’ B, so a finite set of forbidden patterns cannot suffice. It is easy to check that the other two mechanisms do work: we can shift and we can forbid B → ε be preceded by an a.

13 Grammar 3: S → xAa S → yAaa S → zAaD A → BC B → ε C → a C → aa D → ε D → Da

The LR automaton for this grammar is as follows:

State 4 State 0 State 8 S → y • Aaa y S →•xAa z S → z • AD A →•BC S →•yAaa A →•BC B →• S →•zAD B →•

x A A State 1 State 9 State 5 S → x • Aa S → zA • D S → yA • aa A →•BC D →• B →• D →•Da a A a

State 6 B State 2 State 10 B S → yAa • a S → xA • a B D → D • a

a a a

State 7 State 3 State 11 S → yAaa• S → xAa• D → Da•

State 12 C A → B • C a State 14 a State 13 C → a• State 15 A → BC• C →•a C → aa• C → a • a C →•aa

We have a conflict in state 14. However, the LR automaton ‘forgets’ whether we came from state 1, 4 or 8, and so it has ‘forgotten’ whether we just saw an x,ay or a z. This happens because of the nullable B in this grammar. No matter how we resolve the conflict in state 14,

14 some inputs will be rejected even though they are in the language of the grammar. Precede restrictions do work as they can forbid the z preceding either production of C. Forbidden patterns can forbid either production of C in the expansion of A in S → xAD. For the first grammar, a simple augmentation of the precede-follow restriction mechanism would allow it to disambiguate the grammar correctly. We could let the sets of terminals that are not allowed to follow or precede a given production be dependent on the position in the grammar of the parent production: in the above grammar, we could forbid A → a only in the case that A → a is in the position of the first A in S → cAcAc. This essentially turns precede- follow restrictions in a generalization of forbidden patterns: if we take the augmentation far enough we obtain forbidden patterns as a submechanism. However, the following grammar shows that this augmentation is not good enough in general to achieve the disambiguating power of shift-reduce conflict resolution:

Grammar 4: S → A S → C A → abA A → xA A → B B → y C → aC C → bC C → D D → y

Inputs of the form (ab)ny are ambiguous for this grammar. The grammar is easy to dis- ambiguate with shift-reduce conflict resolution. However, both precede-follow and forbidden patterns and any combination of the two cannot disambiguate this properly, as one cannot dis- tinguish (ab)ny from x(ab)ny by using the information of a fixed number of preceding terminals or a fixed amount of information from the derivation tree around the leaf node labeled y in the derivation tree. Our conclusion is that the three mechanisms are incomparable in their power: for any combi- nation of mechanisms, we can create a grammar that can be disambiguated by these mechanisms but not by the remaining mechanisms.

15

Chapter 4

Applying Disambiguation Mechanisms to Expression Grammars

We will now turn our attention to the absolute applicability of these mechanisms. To this end, we will look at expression grammars with unary and binary operators and investigate whether the mechanisms can disambiguate them. We will first give an introductory example of an expression grammar and a hard, interesting goal for this grammar. We will show that forbidden patterns cannot implement this goal. This example occurs in several programming languages. This shows that the mechanism of forbidden patterns has a limit on its absolute applicability. We then prove that the forbidden patterns mechanism can disambiguate expression grammars with binary operators and that the precede-follow mechanism and the LR shift-reduce conflict resolution mechanism can disambiguate expression grammars with unary and binary operators.

4.1 Real-World Example of a Hard Disambiguation Problem

Consider the following grammar. E → aE E → EbE E → cE E → 

We now want a<(left b)

17 Table 4.1: Operators adhering to our grammar Programming language a b c Python 3.3.2 [21] not | ~ Python 2.7.5 [20] not | ~ Microsoft Transact-SQL (SQL Server 2012) [14] not | ~ MySQL 5.6 [15] not | ~ Ruby 2.0 [22] not || ! Perl 5 version 16.3 [19] not || !

Table 4.2: Results of evaluating our expression Programming language output Python 3.3.2 [21] type error Python 2.7.5 [20] type error Microsoft Transact-SQL (SQL Server 2012) [14] type error MySQL 5.6 [15] syntax error Ruby 2.0 [22] syntax error Perl 5 version 16.3 [19] false third reason is that ab should be disambiguated to a(b), which makes it seem logical to disambiguate our original input to c(a(b)). We have not been able to come up with any reason to prefer the other disambiguation. The above grammar and corresponding priorities occurs in several ‘mainstream’ programming languages. We list several in Table 4.1. These languages have included a low priority unary prefix operators for logical negation, not. Most also include a ‘standard’ high priority operator !. The idea is that if one wishes to negate a complicated boolean expression, one need only prefix it with not, without having to bracket the expression depending on the priorities of the operators in the expression. Consider Perl for the moment. The expression ! ! not false || true can be disam- biguated in (at least) two ways: (! ! not false) || true and ! ! not (false || true). Note that (! ! not false) || true evaluates to true and ! ! not (false || true) eval- uates to false, so we can guess how the programming languages parse that input. We list our findings in Table 4.2. We note that only Perl allows the construct at all and also disambiguates it correctly. Inter- estingly, both Python versions happily evaluate ~ ~ (not False | True), suggesting that the type error on the unparenthesized expression is specific to that situation. We will give evidence that this low priority operator is indeed difficult to disambiguating, thus explaining the difficulty that the and interpreters of these programming languages have with this construct. We will prove the following lemma:

Lemma 2 The forbidden patterns mechanism (using a finite set of patterns Q)cannotimple- ment our goal of a<(left b)

Proof. Consider the input cnanb. For every n there are two derivation trees corresponding to cnan(b)and(cnanb). The derivation trees for the case n = 3 are shown in Figure 4.1. Note that any subtree with at most n internal nodes (and with a root labeled E)ofeither of the derivation trees (shown in Figure 4.1) is a valid derivation tree for some word in L(G)

18 E E

E bE cE

cE cE

cE cE

cE aE

aE aE

aE aE

aE E bE   

Figure 4.1: The two derivation trees for n =3 according to a<(left b)

19 4.2 Expression Grammars

We will now introduce two set of grammars: the set of expression grammars with binary opera- tors and the set of expression grammars with binary and unary operators. We will then define goals for these grammars for any combination of precedences and associativities. Finally, we will investigate which mechanisms can implement these goals.

Given a finite set A = {a1,...ak} of binary operators, a finite set B = {b1,...,bl} of unary prefix operators and a finite set C = {c1,...,cm} of unary postfix operators where these three sets are pairwise disjoint, we define the expression grammar (with binary and unary operators) for these sets as follows. We set T = A ∪ B ∪ C ∪{}, N = {E}, S = E and P as follows (spread out over three sets of productions):

E → Ea1 E . .

E → Eak E

E → b1 E . .

E → bl E

E → Ec1 . .

E → Ecm E → 

We gather all expression grammars for all possible A, B and C and call the resulting set the set of expression grammars. If we take B = C = ∅ we get the set of expression grammars with binary operators.Wenotethat represents some sort of ‘bottom’ terminal such as a string of digits. There are several grammars that are usually understood as ‘expression grammars’ that do not fit our definitions, so our definition is quite narrow. First, we do not include parentheses: as parenthesis productions such as E → (E) are unambiguously recognizable, the ambiguities present in an expression do not change if we replace a parenthesized expression (e)bya terminal and consider e separately. Second, we have chosen some constraints that reduce the ambiguities that can arise so we can focus on the problems caused by priorities and associativities. For example, if we remove the requirement that all operators be distinct, then the following ambiguity can occur: if we take + as a prefix, postfix and binary operator, then  +++ can be interpreted as ((+)+) + ,as (+) + (+)oras +(+(+)). If operators are allowed to be strings of terminals, even more ambiguities become possible: if we take ++ and + + + as postfix operators, then  +++++ is ambiguous. We now assume a total order < on A ∪ B ∪ C. This order will represent the priorities of the operators: higher in the order binds tighter. We also assume a (total) function f from A to {left, right} representing an associativity assignment. Note that we do not allow two different operators to have the same priority, which does occur in practice. It is unclear what it would mean for two operators that are in different sets (among A, B and C) to have equal priority.

20 This leaves us with two operators that are in equal sets and that have equal priority, and these can simply be replaced by a single operator to fit our framework. We define the operator of an internal node of a derivation tree of an expression grammar as the terminal among its children – we will say for a node n having operator o that n ‘is a’ o.We lift the total order < to internal nodes by defining n1

4.3 Defining Precedences and Associativities

Surprisingly, there is no canonical formal definition of priority and associativity. Whenever priorities and associativities are defined in the literature, an appeal is made to common mathe- matical knowledge and a few examples are given, rather than a precise definition. Alternatively, an equivalent unambiguous grammar is given. The only thing close to a canonical definition is the Shunting Yard algorithm by Dijkstra [10]. This is a well-known algorithm for parsing expression grammars with binary operators. The correct parse is then defined operationally as the result of this algorithm. We are not aware of a canonical extension to unary operators other than some ad-hoc methods to deal with the unary minus. It is easy to see that on expression grammars with binary operators the algorithm performs the same steps as an LR parser modified by shift-reduce conflict removal: the operator stack corresponds to the LR state stack, a reduce action in the LR algorithm corresponds to popping of the operator stack and a shift action corresponds to pushing an operator onto the operator stack. Our analysis of LR shift-reduce conflict resolution therefore also applies to the Shunting Yard algorithm. We will therefore give our own definition of priority and associativity. Our definition does not give an algorithm and proclaims its result as the correct parse, but rather restricts which derivation trees are allowed, making it more combinatorial and less algorithmic. There are three reasons why our definition makes sense. First, it results in a unique correct derivation tree for any grammar. Second, derivation, trees that are forbidden intuitively violate priority or associativity. Third, it corresponds directly to what our disambiguation mechanisms are capable of implementing, including the Shunting Yard algorithm. We will first give the definition, present some results and repeat why our definitions make sense given our results. We then prove the results. We will first look at binary operators only: in this case, a simple and intuitive definition will suffice. Let r ∈ PT be a derivation tree. If there is a s1 ∈ ST(r) (recall that ST returns the set of all nodes of r)andas2 ∈ ST(s1)withs2

Lemma 3 For any expression grammar G with binary operators, for any w ∈ L(G),thereisa unique canonical r ∈ PT with y(r)=w.

21 E E E

E* . . ˆE E+ . . . .

E E E

E+E EˆE E+E

Figure 4.2: An example for every type of indirect violations using standard mathematic opera- tors. The dots indicate that many nodes are allowed to be present in a chain here.

Using this lemma, we define a goal Db for this grammar that maps every w ∈ L to this unique canonical r ∈ PT. We will prove the following theorem.

Theorem 4 For any expression grammar with binary operators, the forbidden patterns mech- anism, the precede-follow restrictions mechanism and the shift-reduce conflict resolution mech- anism can all implement Db using some set of auxiliary data Q.

The auxiliary data used for these three mechanisms are exactly those that are used in practice. As our definition itself is intuitive – tightly-binding operators are never ancestors of loosely- binding operators – and current practices agree on what goal we should implement, we are content with our goal Db. Unfortunately, when unary operators are introduced, it is not immediately obvious how we should disambiguate certain inputs. We have already seen a problematic example in Section 4.1. We need to decide how to disambiguate cab given a<(left b)

Given an r ∈ PT, we define an indirect violation as follows. If there is a s1 ∈ ST(r)and a s2 ∈ ST(s1)withs2

1. s1 is a binary operator and s2 is a prefix operator and s2 ∈ ST(right(s1)), or s1 is a binary operator and s2 is a postfix operator and s2 ∈ ST(left(s1)).

2. s1 and s2 are both unary operators with the same type as each other.

3. s1 is an unary operator and there is a chain of direct descendants of s1, one of which has a strictly lower priority than s1. All unary operators in this chain must have the same type as s1, and for all binary nodes in this chain the next node in the chain must be its left child if s1 is a postfix operator and its right child if s1 is a prefix operator.

If there is a binary operator s1 ∈ ST(r)andf(s1)=left,andthereisas2 ∈ ST(right(s1)) with s2 ∼ s1,oriff(s1)=right and there is a s2 ∈ ST(left(s1)) with s2 ∼ s1, and exception 3 above does not apply, then we also call (s1,s2)anindirect violation for r.Ifr has no indirect violations, we will call it canonical.

22 E E E . E+ . ! . E+ . . . .

E E E

not E not E ! E . not .

E

E+E

Figure 4.3: An example for every exception to an indirect violation, corresponding to exceptions 1,2and3countingfromlefttoright.Weassumenot has a very low precedence. The dots indicate that many nodes are allowed to be present in a chain here.

This definition results in the following:

Lemma 5 For any expression grammar G with binary and unary operators, for any w ∈ L(G), there is a unique canonical r ∈ PT with y(r)=w.

Using this lemma, we again define a goal Dub for this grammar that maps w ∈ L to this unique r. Note that this includes the goal for grammars with binary operators as a special case.

Theorem 6 For any expression grammar with binary and unary operators, the precede-follow restrictions mechanism and the shift-reduce conflict resolution mechanism can both implement Dub using some set of auxiliary data Q. The forbidden pattern mechanism cannot implement Dub using a finite set of forbidden patterns Q.

The auxiliary data we will use to prove this theorem is again precisely those used in practice. We again justify our choice for our definition by noting that we have a unique solution and that current practices implement exactly this definition. As an aside, note that Lemma 2 proves the last claim about forbidden patterns.

4.4 Disambiguating Expression Grammars with Binary Operators

In this section we will first prove Lemma 3. We then show that forbidden patterns can imple- ment our goals for expression grammars with binary operators through the concept of a direct violation.

Proof of Lemma 3

We will now prove Lemma 3, asserting the existence of a unique canonical derivation tree for every w ∈ L(G) for expression grammars with binary operators.

Proof. Let w be a word in the language of G. We induce on the number of operators in w.If this number is 0, then w = ,and is its only derivation tree and it is unique and canonical.

23 Assume therefore we have at least one operator. Let ai be the smallest operator present in w with respect to <. Assume f(ai)=left – the other case is analogous. As w ∈ L(G), there must be an r1 ∈ PT with y(r1)=w. We perform rotations in r1 so that the internal node which is the parent of the rightmost leaf node labeled ai in r1 becomes the root of the tree. Name this new derivation tree r2. As the yield of the tree is preserved by these rotations, we have y(r1)=y(r2)=w.

Note that y(left(r2)) ∈ L(G)andy(right(r2)) ∈ L(G): the roots of left(r2)andright(r2) are both E and are therefore derivation trees. By the induction hypothesis we can replace left(r2)andright(r2) by the unique canonical derivation trees corresponding to y(left(r2)) and y(right(r2)): name this new derivation tree r3. We now claim that r3 is the unique canonical r ∈ PT with y(r)=w.

As ai was minimal according to <, as we picked the rightmost ai and as our subtrees are canonical by the induction hypothesis, it is easily seen that r3 is canonical. For uniqueness, we first note that by the induction hypothesis, r3 is the only canonical tree with this rightmost ai as root: if there was a different one, then either of its subtrees would have been different but still canonical contradicting their uniqueness. Any other canonical tree must therefore have a different root. However, if we pick a root with an operator with higher priority than ai, we would get an indirect violation with ai. Finally, if we pick an occurrence of ai that is not the rightmost one, then this would contradict the left associativity of ai, causing an indirect violation as well. This proves the lemma. 

We will now prove that forbidden patterns can disambiguate expression grammars with binary operators. This proves a part of Theorem 4; we conclude the remainder of this theorem as a special case of Theorem 6 after proving that theorem. To prove this initial part, we introduce the concept of a direct violation and show how it relates to indirect violations. Let r ∈ PT.Ifthereisas ∈ ST(r)withleft(s)

Lemma 7 Let r ∈ PT. r has an indirect violation if and only if it has a direct violation.

Proof. A direct violation is obviously an indirect violation, proving the ‘if’ part, so we now look at the ‘only if’ part. Let (t1,t2) be an indirect violation. First consider the case t1 >t2. Consider the chain of direct descendants x1,x2,...,xk inthetreewithx1 = t1 and xk = t2 and xi+1 being a child of xi for all i.Ifwehavexi+1 t1.

Assume therefore that t1 ∼ t2, and note that by a similar argument as in the previous case that xi+1 ≥ xi for all i, and hence t1 = x1 ≤ xk = t2, which implies that all xi on the path from t1 to t2 must be equivalent according to ∼. It is then easy to see that the left or right child (depending on the type of the assumed indirect violation) of t1 immediately gives a direct violation. This proves the lemma. 

We can now enumerate all possible tree patterns with a binary operator as root and a binary operator as either the left or the right child (and no other children) and add all such patterns that fit the criterion of a direct violation to Q. Q now forbids exactly the derivation trees containing a direct violation, or equivalently by Lemma 7 the derivation trees containing an indirect violation

24 and therefore by definition exactly the non-canonical derivation trees. We conclude that the forbidden pattern mechanism can implement our goal Db for expression grammars with binary operators. This proves the first part of Theorem 4.

4.5 Adding Unary Operators

This entire section deals with the proof of Theorem 6. The high-level argument is as fol- lows. We first show the existence part of Lemma 5, thus showing that the set C(w)= {r ∈ PT | r is canonical,y(r)=w} is nonempty for every w ∈ L(G). We then give pre- cede and follow restrictions Q1 for G and show that a violation of these restrictions implies an indirect violation, thus showing C(w) ⊆ PrecedeFollow(G, Q1)(PT,w). We then give the shift-reduce conflict resolutions Q2 for G and show that a removal of a shift or reduce option removes only derivation trees that violate our precede and follow restrictions, thus showing PrecedeFollow(G, Q1)(PT,w) ⊆ ShiftReduce(G, Q2)(PT,w). As Q2 will make the LR table fully deterministic, |ShiftReduce(G, Q2)(PT,w)|≤1. As C(w) = ∅, we conclude C(w)=PrecedeFollow(G, Q1)(PT,w)=ShiftReduce(G, Q2)(PT,w)=D(w), thus proving Theorem 6.

Proof of Lemma 5

We will now prove the first part of Lemma 5, asserting the existence of a canonical derivation tree for every w ∈ L(G) for expression grammars with binary and unary operators.

Proof. Let w be a word in the language of G. We induce on the number of operators in w.If this number is 0, then w = ,and is its only derivation tree and it is canonical.

Assume therefore we have at least one operator. Let xi be the smallest operator present in w with respect to <.Ifxi ∈ A, we apply the argument used in the proof of Lemma 3, with the only exception that rather than doing only rotations to get xi to the root, we sometimes have to ‘swap’ xi with an unary parent, which can again be done without affecting the yield of the derivation tree. Recall that the argument was that, once we have rotated the left- or rightmost xi to the root, we can replace its left and right child by canonical subtrees by our induction hypothesis, making the entire tree canonical.

Assume therefore xi ∈ B (thecasethatxi ∈ C is analogous). We fix xi to be the leftmost occurrence of this operator in w.Asw ∈ L(G),thereissomederivationtreer1 whose yield is w.Weidentifyxi with the internal node in the derivation tree under consideration that corresponds to xi. We will now again make local changes to the tree and ‘swap’ xi towards the root.

We claim we can perform local, yield respecting swaps to get a derivation tree r2 where the path from the root y1 to yk = xi, denoted as y1,...,yk, consists entirely of binary and prefix operators, and if yi is a binary operator then yi+1 is the right child of yi. See for an example Figure 4.4. If xi is the left child of a binary operator or the child of a postfix operator, we can exchange the two to move xi towards the root. If we consider the chain from the root to our xi and consider the internal node closest to xi that is either a postfix operator or is a binary operator whose left child is the next node in the chain, then we can exchange this node downwards with its child, as its child is either a binary or a prefix operator. We continue doing this until we arrive at r2.     Let w = y(xi), then w = w w for some w by the above construction. If we replace xi in r2 by , we show that w ∈ L(G). By the induction hypothesis, there are canonical derivation

25 E E E

E *  E ˆE  ˆE

E ˆE  ˆ  ! E  ˆE

 ˆ  ! E not E ! E  E +  E * not E

not   +   +E

 * 

Figure 4.4: An example of the modification described in the proof. Note that xi = not.We achieve r2 (middle) from r1 (left) by moving not upwards until it cannot be moved further up and ∗ downwards until is below not.Wethenachiever3 (right) from r2 (middle) by using the induction hypothesis to fix the violations above and below not. Note that w =!not +  and w = ˆˆ.

 trees t1 and t2 for w  and for the subtree that is the child of xi respectively. We take t1,replace the rightmost  leaf by xi, and then replace the child of xi by t2, resulting in a derivation tree r3 with yield w. See for an example Figure 4.4.

We now claim r3 is canonical. We first note that since t1 and t2 are canonical, for any indirect violations (a, b)wehavethata = xi or a is an ancestor of xi,andb = xi or b is a descendant of xi. By exceptions 1 and 2 of the definition of indirect violations, violations involving b = xi are not possible. Violations involving a = xi are not possible either as we picked xi to be the minimal leftmost operator. All remaining possibilities for violations are covered by exception 3, as xi has lower priority than any of its ancestors and its ancestors have the chain required by exception 3 to xi.Thisprovesthelemma. 

We will now define Q1 =(P, F) as follows. Essentially, productions E → Eα may not be preceded by higher-priority operators, and productions E → βE may not be followed by higher-priority operators, in addition to restrictions for associativities.

P = {(E → EaiE,aj) | ai,aj ∈ A :(ai

Lemma 8 For any expression grammar G with binary and unary operators, if r ∈ PT violates Q1, then it has an indirect violation.

Proof. We prove the lemma for violations of precede restrictions: the case of follow restrictions is analogous. We note that to obtain the value of a Precede operation for some node x in r with prod(x) of the form E → Eα, we first ignore a chain of ancestors of x that also have a

26 E

α E

E β1

E β2

E β3

E β4

Figure 4.5: An example showing that the precede value for E → β4 is defined by α. production of the form E → Eα (and for which x isinthesubtreeofthisleftmostE)untilwe gettoanodey with a production of the form E → βE with x in the subtree of this rightmost E (unless no such y exists, in which case the precede value is $, which can not cause violations of Q1). Figure 4.5 shows an example of this.

Assume we have a node x whose precede value violates Q1.Lety bethenodethatis responsible for Precede(x)asabove.Letd be the first character of α and e the last character of β.WehavethatPrecede(x)=e. Assume d

We will now consider shift-reduce conflicts in the LR table for G.Wefirstnotethatreduce- reduce conflicts are impossible for G, as can be seen as follows. If A → α and B → β have a reduce-reduce conflict, we must have α = γβ or β = γα, which can be seen by following incoming arrows until the labels followed spell out α or β (whichever is shorter) in reverse, which means A →•α or B →•β is present in the item set, as well as B → γ • α or A → γ • β respectively. α = γβ or β = γα is impossible as every production contains a unique operator terminal, so no reduce-reduce conflicts are possible. Secondly, we note that the only shift-reduce conflicts that can occur are between E → αE• and E → E • β productions: postfix operators are unique and hence cannot be in conflict with other operators when they are about to be reduced, so we are left with E → αE• productions that can take the reduce role in the conflicts. As α contains a unique terminal, the only potential conflicts can arise from items predicted at the LR itemset containing E → α • E.Theonly items that can then cause a conflict therefore look like E → E • β.

We now fix Q2 by deciding to shift or reduce for all the above potential conflicts. Consider a conflict between E → αE• and E → E • β.Letd be the last character of α and e the first character of β. We decide to shift if de;ifd = e then d ∈ A:if f(d)=left we decide to reduce and we shift otherwise, so if f(d)=right.

27 State X State Y d E → α • E E E → αE• E →•Eβ E → E • β . . . .

Figure 4.6: A part of an LR automaton of an expression grammar.

It is well known that the LR algorithm, if we assume our parser solves conflicts by nonde- terministically picking an option every time, can parse all context-free grammars (if we only consider successful parses). There is therefore a nonempty set of all possible successful LR parses for a given input, and we can match the derivation tree generated by the successful LR parse to the successful LR parse itself, thereby giving a one-to-one correspondence between successful LR parses and derivation trees. If the filter generated by Q2 removes a derivation tree r, then the corresponding successful LR parse therefore executed a shift or reduce action at some point that was removed by Q2.Wewillsaythatr violates Q2.

Lemma 9 For any expression grammar G with binary and unary operators, if r ∈ PT violates Q2, then it also violates Q1.

Proof. Assume the LR parse corresponding to r executed a reduce action that is forbidden according to Q2. A reduce action in LR generates an internal node x in the derivation tree corresponding to the production E → αE that was reduced on. The lookahead when making this decision was e and so x is followed by e. It is easy to check that (E ∈ αE, e) ∈ F ,andso r violates Q1.

Now assume the LR parse of r executed a shift action on e that is forbidden according to Q2. As there is only one production containing e, which also only contains e once, E → Eβ say, this shift on e must have resulted in a reduce on E → Eβ later on in the LR parse. This generates an internal node x corresponding to the production E → Eβ. We now note that in the LR state s that contains the shift-reduce conflict, both E → αE• and E → E • β are present. This means that for all LR states that have an E transition to s, both E → α•E and E →•Eβ are present. As E → α•E is present and by standard facts about LR automata, we conclude that all incoming transitions to all these states must be labeled d (where d is the last terminal in α). This means that when the node x was predicted by the LR automaton, the previous terminal must have been d. It is easy to check that (E → Eβ,d) ∈ P , and so r violates Q1. This proves the lemma. This is shown in Figure 4.6. 

We now note that Q2 resolves all possible shift-reduce conflicts. The resulting table is deter- ministic, so running LR with this table on any input can result in at most one derivation tree. By the reasoning at the start of this section, we conclude Theorem 6.

28 Chapter 5

Disambiguation Strategies

5.1 Introduction

We define a disambiguation strategy as a method to analyze a given context-free grammar for ambiguities, either resulting in auxiliary data for one or more disambiguation mechanisms, or resulting in some form of error. It need not be fully automatic: it may involve user input and it may generate auxiliary data that cause the mechanisms to generate filters that are not correct or complete. We shall define six quality measures for disambiguation strategies. If a strategy generates faulty filters, the strategy will simply rank poorly on some quality measure. Note that if a strategy uses multiple filters, it will have to specify how these interact, as this need not be straightforward. We propose the following quality measures on disambiguation strategies. For every measure we introduce a term for that measure in brackets. We will refer to these terms in the rest of this thesis.

1. Does the strategy give guarantees that all ambiguities have been pinpointed and dealt with? (completeness)

2. Does the parser accept all words in the language of the original grammar? (correctness )

3. Can the user influence the derivation tree that is returned? (tunability)

4. How easy is it for the user to understand how the derivation trees are chosen? (trans- parency )

5. On how many grammars does the strategy succeed in producing a correct and complete filter? (applicability)

6. How efficient is the resulting parser and how expensive is it to perform the strategy? (efficiency)

Note that most of these measures are subjective. In particular, the completeness measure is a bit fuzzy. We would like to call a strategy complete if it has ‘consciously’ disambiguated all ambiguities and knows for sure that it did not miss any ambiguity: after all, we are aiming to give every input a single well-defined meaning. For example, the strategy “Try to come up with forbidden patterns and modify the LR automaton so it respects them until the LR table is deterministic” [30] is a complete strategy: the lack of conflicts in the table guarantees that all ambiguities are gone, and the set of forbidden patterns are the ‘solution’ to the ambiguities. Note that it would not be a problem if the strategy finds extra forbidden patterns that do

29 not remove ambiguity: this would be a correctness issue, not a completeness issue. As it is undecidable to test a context-free grammar for ambiguity, a strategy cannot be both complete and universally applicable. We want to avoid that any strategy can be made complete using the trick “if multiple parse trees remain at the end of disambiguation, just pick one and discard the rest” and can be made correct using the trick “don’t remove the last parse tree even if the filter forbids it”, as we feel that if these tricks are needed the strategy should just be considered faulty. We would like to be able to guarantee that the filters have been applied: for example, if we use forbidden patterns, we would like that the language engineer can rely on that these patterns really are removed and will not ever be a part of the returned derivation tree. It seems hard to formalize this, but the intuition should be clear. As an example of our quality measures, we consider the strategy in which we use a general parsing algorithm such as GLL to compute all possible derivation trees, after which we simply return an arbitrary computed derivation tree. It is easy to see this strategy is correct :by the correctness of the general parsing algorithm we accept if and only if the input was in the grammar. The strategy is incomplete, for the strategy does not even attempt to look for ambiguities in the grammar and so tells the user nothing about whether his grammar is ambiguous. It is not tunable as described, for the user has no way to influence what derivation tree is returned, though it may be possible to allow the user some measure of control over the returned parse tree using some heuristic. It may be transparent, depending entirely on whether the procedure that decides what derivation tree to return is easy to understand. It is universally applicable. Depending on the parse tree that is returned, it is not very efficient:onthesimple grammar S → SS|a most general algorithms will spend Θ(n3) time creating a representation of all (exponentially many) derivation trees, while a strategy that disambiguates to the grammar S → aaA, A → aA|ε is parsable in linear time by an LL parser. Note that there may be some derivation trees that require Ω(n3) time to be found, in which case the strategy is efficient – the LL grammar forces a particular derivation tree on the user. Our measures are intended to evaluate strategies that try to tackle ambiguities such as the expression grammar ambiguities in a ‘static’ manner only, as opposed to strategies that read disambiguation information from an input, such as user-defined operators. Furthermore, some strategies aim to disambiguate grammars for scannerless parsers or to resolve ambiguities based on type information. We do not take such aims into account in our quality measures, so strategies that have these aims may rank poorly on some measures.

5.2 Disambiguation Strategies Introduced in the Literature

We will now discuss the various disambiguation strategies that have been introduced in the literature by evaluating them according to our quality measures. Our evaluation is based on our own assessment of the methods and the assessment of the authors of the papers themselves. We use the results of our investigation of disambiguation mechanisms multiple times. Our results are summarized in Table 5.1. The papers by Aho, Ullman [2] and Johnson [3] introduce two disambiguation strategies. The first strategy is the widely used strategy of resolving shift-reduce conflicts in an LR table, as employed by tools such as Yacc and Bison. The second strategy is a relatively unknown strategy of pruning conflict-containing LL tables in much the same manner. They give a proof of correctness of the second strategy, but only examples for the first strategy. Indeed, Soisalon- Soininen and Tarhio [28] give an example where a resolution of a conflict (using the first strategy) leads to a parser that does not terminate on all inputs.

30 Both their strategies are complete and efficient, as they essentially result in deterministic LR and LL parsers. From Chapter 4 we know that the LR strategy works on expression grammars with unary and binary operators. The LR strategy has proven to work in more cases in practice, but the LL strategy only seems to work on the dangling-else ambiguity. They have proven their LL strategy correct, but their LR strategy is incorrect. Their LR strategy is tunable as the user can decide to either shift or reduce, but it is not transparent because the effects of these decisions are hard to understand. Their LL strategy is not tuneable but this makes it easy to understand and hence transparent. Earley [12] introduced a similar strategy at around the same time. While he specifies clearly how the user might denote precedences, he does not consider the completeness of these prece- dences. If applied to the LR parsing technique, the resulting strategy is effectively identical to that of Aho, Johnson and Ullman [3] and hence scores the same on the quality measures. Wharton [34] describes a strategy that orders bottom-up parses and top-down parsers and employs to return the least derivation tree according to this ordering, called the canonical derivation. This approach is generally applicable (assuming all potential ambiguities are dealt with by the ordering, which is not proven) and correct. As described it is not tunable or transparent, but we can make it somewhat tunable by varying the ordering used. The efficiency of the strategy is poor, as backtracking is used, with potentially exponential running times. LaLonde and Des Rivieres [13] describe a strategy that works on many expression grammars with unary and binary operators. They first parse expressions in a left associative form, ignoring the associativity and precedence of the operators involved. They then use standard binary tree rotations to transform the tree into an expression tree that does take into account associativity and precedence. As this processing phase is a post-processing step, the information about the allowed operators and their precedences can be loaded entirely from the input, which is an advantage. A disadvantage is that some common combinations of operators cannot be disambiguated in the natural way: for example, a unary operator binding less tightly than a binary operator, such as the unary minus and member selection (-x.myvalue), cannot be expressed in their formalism. More generally, their strategy is effectively based on forbidden patterns and hence cannot disambiguate all expression grammars naturally. Their strategy is therefore correct, complete, tunable, transparant and efficient, but unfortunately it is only applicable on a limited subset of operator expressions. Salomon and Cormack [23] propose a disambiguation strategy aimed at resolving the am- biguities that arise if one does not use the traditional split between scanner and parser and instead lets the parser do the scanning. This strategy is therefore not aimed at expression lan- guages and indeed cannot disambiguate them at all. We still treat it here as an example. They generate an efficient NSLR(1) parser (which is a more powerful version of an SLR(1) parser – see the paper for details) from the user-supplied disambiguation rules, making the strategy tunable. Any untreated ambiguities will cause parser generation to fail and hence the strategy is complete. The mechanisms involved are somewhat transparant, although the full effects of a disambiguation rule may not be readily understood. Unfortunately, they give no guarantees on correctness of their method and their method is applicable only for ambiguities caused by scannerless parsing. Thorup [30] describes a strategy involving an algorithm that computes whether there exist a finite set of forbidden patterns such that the canonical LR(1) parser (among other parsers) is fully disambiguated if it is pruned to respect this set of forbidden patterns. It guarantees that the resulting parser still accepts all inputs. The algorithm can even try to find such a set of patterns given an initial set of forbidden patterns, which can be used to steer the disambiguation. This strategy is both correct and complete. It is also tunable by tweaking

31 the initial set. Unfortunately, it is not transparent as it is not really clear how inputs will be disambiguated. Although the resulting parser will be efficient, the computation of these patterns is a potentially expensive operation, as already mentioned by the author. Finally, the algorithm is not always applicable as it uses finite sets of forbidden patterns as disambiguation mechanism and is based on LR machinery. Thorup [31] also describes a strategy that involves rewriting a grammar given a finite set of forbidden patterns so the resulting grammar indeed no longer produces these patterns. We could use the techniques from [30] to generate these patterns, but these techniques already give a fully functional parser, so we look at the strategy where the user supplies the forbidden patterns, avoiding the potentially expensive step of running the algorithm from [30]. Using a general parser on the resulting grammar would render the strategy incomplete (as it may miss ambiguities) and somewhat inefficient but more widely applicable. Using a deterministic parser would give completeness and parsing efficiency, but some applicability is lost. The strategy is not correct or transparent and suffers from the same applicability problems as [30], though it is tunable. Finally, although the grammar rewriting tends to give nice grammars on small examples, it can result in exponentially sized grammars in the worst case, which is an efficiency concern. Visser [33] describes a very general framework of disambiguation filters in his PhD thesis; our definition of a filter (mostly) matches his. Specific filters he describes (amongst others) are forbidden patterns, and higher orderer forbidden patterns. These higher order patterns can dis- ambiguate all expression grammars with unary and binary operators as well as the dangling else ambiguity. While highly tunable and widely applicable, it is not clear how these filters should be implemented efficiently. Higher-order forbidden patterns are not very transparent. Further- more, the disambiguation strategy offers no help with finding and resolving the ambiguities or with proving them correct, and is hence incorrect and incomplete. Van den Brand, Scheerder, Vinju and Visser [8] propose a varied set of disambiguation mech- anisms which is used in SDF. Rather than using shift-reduce conflict resolution, they influence the generation of the LR automaton in an attempt to adhere to precedences and associativities. Unfortunately, their method comes down to a combination of follow restrictions and forbidden patterns; the forbidden patterns attempt to emulate precede restrictions, but as we have proven this will not work in all cases. The strategy also uses reject, prefer and avoid mechanisms. As the prefer and avoid mechanism is not incorporated into the parse table (the general parsing algorithm GLR is used and the resulting parse forests are pruned in a post-processing step to implement this mechanism), table generation does not give information about whether all am- biguities are resolved, so the strategy is incomplete. The various mechanisms make the strategy tunable and quite widely applicable. Efficiency suffers somewhat as some mechanisms have implementations that may require more than linear time. Transparency is partially achieved as the mechanisms are understandable (the forbidden patterns and precede-follow restrictions are generated from precedence and associativity information), though the full consequences of the mechanisms are not always directly clear. No mechanism is given to guarantee correctness. Bas Basten [6] offers a program, Dr. Ambiguity, that automatically searches for ambiguities in a grammar and proposes solutions in terms of auxiliary data for the mechanisms implemented in SDF. This helps the average user a lot and gives a very transparent strategy, particularly because example ambiguous statements are automatically generated. Tunability is retained from SDF as the user still decides how to resolve the problems. Applicability is quite good as is apparent in their experiments. Efficiency is almost linear except for the avoid and prefer mechanism, similar to SDF. The main missing point in this strategy is correctness as no guarantees are given in this regard. Also, the tool expects the user to somehow come up with an ambiguous sentence before offering help, so correctness is as good as the ambiguity detection used.

32 Table 5.1: Ranking of disambiguation strategies on our quality criteria. Question marks signify that there is some doubt about how that strategy ranks on that criterion. An explanation for every question mark can be found in the list of strategies. Correctness has been abbreviated to corr. completeness corr. tunability transparency applicability efficiency [2, 3], LR + – + – +? ++ Example – + – +? ++ – [12] + – + – +? ++ [3], LL + + – + –? ++ [34] – + ? – + –– [13] + + + + –– ++ [23] + – +  ? ++ [30] + + + –  + [31] –/+ – + –  + [33] – – +  + –? [8] – – +  + + [6] ? – + + + +

Chapter 6

A rewriting-based disambiguation strategy

6.1 Introduction

We will now discuss our proposed disambiguation strategy. It is summarized schematically in Figure 6.1. It has several useful properties: using the terminology of Chapter 5, it is correct, complete, tunable, transparent and efficient. This means it guarantees that the full language of the grammar is preserved, that all ambiguities present in the grammar are found, presented to the user and disambiguated, that the user can make meaningful choices in the method of disambiguation, that the user can understand the presented ambiguities and can understand the consequences of the chosen disambiguation method and finally that the resulting disambiguation can be implemented efficiently in parsers. The only criteria of Chapter 5 our strategy may not excel is applicability, or the number of grammars for which strategy successfully finds a disambiguation. An extended version of the basic strategy we present should at least be applicable to expression grammars. Although the strategy may not work on any and all grammars encountered in practice, there is no inherent restriction of its applicability to some specific set of grammars. The basic strategy we present here can be extended to be applicable to more grammars in three ways, which we will describe later. The idea behind basic strategy is based on two observations – though extended versions of the strategy may generalize beyond these observations. The main observation is that languages generated by expression grammars are in fact regular languages. For regular expressions, queries such as language equality and ambiguity are decidable, in contrast with general context-free languages. This gives us powerful tools to help guarantee the correctness and completeness of our strategy. The second observation is related to the disambiguation mechanism we use in our strategy: the precede-follow restriction mechanism. We have examined this mechanism in Chapter 4 and have shown it is effective in disambiguating expression grammars. Other examples such as the dangling else construct it can also disambiguate. The observation is that the ambiguities our strategy finds can be naturally disambiguated by this mechanism. Our strategy generalizes the well-known algorithm (see e.g. [27]) that turns any regular gram- mar into a regular expression. This algorithm is essentially a set of rewriting rules: it starts from a system of productions and then uses rewriting rules to combine these productions, to rewrite them into a certain normal form and to remove self-recursion using Arden’s lemma. If the original grammar is a regular grammar, this operation will eventually succeed, but on

35 general grammars, it may not. Our strategy adds additional rewriting rules that allow the algorithm to succeed on more grammars than the basic algorithm. Some of our rewriting rules will remove ambiguities from the grammar. If this woud be the case, the ambiguity is first presented to the user along with options for removing the ambiguity: the user can choose between (potentially several) precede and follow restrictions. Once the user has made his choice, the rewriting rule is applied and the ambiguity is removed. The new disambiguation rule will need to be taken into account for the rest of the algorithm: it may be the case that the same production occurs elsewhere in the grammar, so this occurrence may need to be removed if it is a violation of the disambiguation rule. At the end of this process, we can calculate whether the remaining regular expression has any ambiguities left in it [5]. The user can then add extra disambiguation rules to remove any ambiguity that is left. At several points in this process we need to know something about how the current system of regular expressions corresponds to applications of productions of the original grammar. In order to maintain this information we introduce the concept of a structured regular expression. A situation where this information is needed is when we wish to introduce new disambiguation rules: these rules are attached to productions, so we need to know what productions are involved in the ambiguity. As this information needs to be maintained, we are somewhat constrained in what rewriting rules we can use: the rules need to preserve the derivations represented by the structured regular expression and may only throw away applications of productions that violate newly introduced disambiguation rules. There is one remaining problem: the disambiguation rules we introduce may forbid applica- tions of productions that are not involved in the ambiguity that was removed by introducing this rule. These unrelated applications of productions may not be part of an ambiguity and may therefore remove words from the language represented by the system of regular expressions. To detect this, we perform the above process twice: the first time, we introduce and enforce the disambiguation rules as specified by the user, but the second time, we only remove the ‘local’ ambiguity detected by the rewriting rules while without globally introducing a disambiguation rule. This second process need not use structured regular expressions and may therefore use additional rewriting rules that simplify the regular expressions: the only requirement is that the resulting regular expression has the same language as the original grammar. After both processes have finished we end up with two regular expressions: the first adheres to the disam- biguation rules and the second preserves the original language. We can then compute if these two regular expressions represent the same language, in which case the disambiguation rules did not remove any strings from the language, thus proving correctness of the strategy. If the computation returns that the two were not the same, the strategy fails. We see three options to make our strategy more powerful than the basic strategy presented in this thesis. For example, we think ambiguity filtering [6, 24] may help in discovering the ‘ambiguous part’ of the grammar. For example, consider an otherwise unambiguous program- ming language that contains an expression grammar as a subgrammar: if we can detect that the only potentially ambiguous part is this subgrammar, we can use our strategy on just this subgrammar. This would allow us to disambiguate the full grammar, even if the rest of the grammar is for example not regular. A second potential improvement is to add an extra step if the rewriter gets stuck, that checks if the current partially rewritten grammar belongs to some class of unambiguous grammars, such as the LR(1), LL(1) or simple grammars classes. For quite a number of such classes, language equality is decidable [16, 17, 18, 29, 32], and in the case of simple grammars, it is even feasible in practice [7]. This may allow the strategy to work on entire programming languages, which contain non-regular constructs such as brackets.

36 grammar

convert convert

struct. regex regex system system

user input disamb. rewrite rewrite rules

OR OR

struct. struct. regex regex regex regex system system

LR test ambig. test LR test

equality test

Figure 6.1: Schematic overview of our disambiguation strategy

Thirdly, adding more rewriting rules will (up to problems with confluence) also increase the applicability of the strategy. In particular, the basic strategy as presented here only has rewriting rules for converting expression grammars to regular expressions and not to structured regular expressions, as it turned out that such a rewriter was much more complicated to define than we expected, so we did not manage to make a full definition for this thesis in time. Adding this rewriter will extend the applicability of the strategy to expression grammars. Additional rewriting rules may help in rewriting even more grammars.

37 6.2 Systems of Regular Expressions

We will use the standard definition of regular expressions over a set T of terminals (see e.g. [27]), using | for the operation of finite union, juxtaposition for concatenation, ∗ for closure, ε for the empty string and ∅ for the empty language. | has lowest priority and ∗ has highest priority and | and juxtaposition are both associative. We define the relation is a subexpression of (or alternatively named contains) as follows: if r1 and r2 are regular expressions, then they are ∗ subexpressions of r1r2,ofr1|r2 and of r1. We will liberally use the associativity and commuta- tivity of alternation operators in reordering their operands. A system of regular equations is a quadruple G =(T,N,S,P). T is a set of terminals, N a set of nonterminals (T ∩ N = ∅), S ∈ N the starting symbol, V = N ∪ P the set of vocabulary symbols and P is the production function that associates a regular expression over V with every nonterminal. If for some A ∈ N, P (A) contains A,wesaythatA is self-recursive. We say that a system of regular equations G is atomic if P (S) does not contain any nonterminals. We can easily convert any context-free grammar to a system of regular equations. T , N and S stay the same. For a given A ∈ N,letw1,...,wk be the distinct right-hand sides of all (A, w) ∈ P . We define P (A)=w1| ...|wk. In order to define the language of a system of regular equations, we define an equation L(A)=L(P (A)) for ever A ∈ N. AselementsofN may occur in P (A), the right-hand sides of these equations may not be properly defined. An assignment of a language to L(A)foreach A is a solution to the system if the equations hold for these assignments. It is easily seen that there is a unique minimal solution to the system, so we let L(A)havethevalueofthisminimal solution. We then define the language of the entire system as L(S).

6.3 Rewriting Rules for Regular Expressions

6.3.1 Defining Rewriters

A regular equation rewriter (or just ‘rewriter’ in this section) is a function from systems of regular equations G =(T,N,S,P) to systems of regular equations G =(T ,N,S,P). A rewriter r is language preserving if L(G)=L(r(G)) for every G. Our goal is to find a language preserving rewriter that results in atomic systems of regular equations for as many grammars as possible.

Given a sequence of rewriters r1,...,rk, we define the first-that-works rewriter r as follows: for every i with 0 ≤ i

38 6.3.2 Our Rewriters

We will now define a sequence of language-preserving, iterable rewriters. The rewriter that we use in our strategy will be the iterated rewriter of the first-that-works rewriter of this sequence.

The first rewriter r1 we use takes nonterminals other than S that are not self-recursive and re- moves these nonterminals from the grammar, by replacing all occurrences of such a nonterminal A by P (A)inP (X) for all X ∈ N.

Lemma 10 r1 is language preserving and iterable.

Proof. Language preserving is immediate as we perform a substitution with a regular expression with the same language. r1 either decreases the number of nonterminals by 1 or returns the input system, so it is iterable. 

The second rewriter r2 is a generalized version of Arden’s. It uses the following pattern replacement rules:

X = d∗(aX|b)c∗ ⇒ X =(a|d)∗bc∗ X = c∗(Xa|b)d∗ ⇒ X = c∗b(a|d)∗

In these patterns, a, b, c, d are regular expressions. Note that if c = d = ∅, we recover Arden’s lemma.

Lemma 11 r2 is language preserving and iterable.

Proof. We only prove the lemma for the first pattern as the second pattern is just the first pattern in reverse and so the proof for that pattern is analogous. Iterability is easily seen as the number of self-references goes down by 1 after every application of the rewriter.  We will use two systems of regular expressions: G is the input and G = r2(G) is the output.   We wish to prove that L(G )=L(G). r2 preserves the nonterminals, so let X be the nonterminal of G corresponding to the nonterminal X in the pattern. We will prove L(X)=L(X)which   implies L(G )=L(G)asr2 does not affect other variables. We will show L(X ) ⊆ L(X)by showing that X satisfies X = d∗(aX|b)c∗. We will then show that L(X) ⊆ L(X) by taking a word w ∈ L(X)andshowingw ∈ L((a|d)∗bc∗). We want to show that X =(a|d)∗bc∗ = d∗(a(a|d)∗bc∗|b)c∗ = d∗(aX|b)c∗. Wewillusethe well-known rules L(a∗)=L(aa∗|ε) L(a∗a∗)=L(a∗)andL((a|b)∗)=L((a∗b)∗a∗).

d∗(a(a|d)∗bc∗|b)c∗ d∗a(a|d)∗bc∗c∗|d∗bc∗ d∗a(a|d)∗bc∗|d∗bc∗ d∗a(d∗a)∗d∗bc∗|d∗bc∗ (d∗a(d∗a)∗|ε)d∗bc∗ (d∗a)∗d∗bc∗ (a|d)∗bc∗

39 Now let w ∈ L(X). As X = d∗(aX|b)c∗ = d∗aXc∗|d∗bc∗, there exists some n such that ∗ n ∗ ∗ ∗ n k1 k2 kp l m n1 n2 nq w ∈ L((d a) d bc (c ) ). This means that w = d ad a...ad d bc c c ...c for k k k l m+ q n p, q, k1,...,kp,l,m,n1,...,nq ∈ N. Simplifying, we get w = d 1 ad 2 a...ad p d bc i=0 i .It should be clear that w ∈ L((a|d)∗bc∗). This proves the Lemma. 

The third rewriter r3 removes superfluous self-recursion. It uses the following pattern re- placement rules:

X = c(aX|b)∗ ⇒ X = c(ac|b)∗ X =(Xa|b)∗c ⇒ X =(ca|b)∗c

In these patterns, a, b, c are regular expressions.

Lemma 12 r3 is language preserving and iterable.

Proof. We again only prove the lemma for the first pattern as the second pattern is just the first pattern in reverse. Iterability is easily seen as the number of self-references goes down by 1 after every application of the rewriter.  We will use two systems of regular expressions: G is the input and G = r2(G) is the output.   We wish to prove that L(G )=L(G). r2 preserves the nonterminals, so let X be the nonterminal of G corresponding to the nonterminal X in the pattern. We will prove L(X)=L(X)which   implies L(G )=L(G)asr3 does not affect other variables. L(X ) ⊆ L(X) is easily seen: c ∈ L(X), so L(c(ac|b)∗) ⊆ L(c(aX|b)∗). WewillnowshowthatL(X) ⊆ L(X), so let w ∈ L(X). We will show that w ∈ L(c(ac|b)∗). We induce over the number of times X is expanded in c(aX|b)∗ to derive w.IfX is never expanded, then b is always picked as the alternative, so w = cbn for some n ∈ N and w ∈ L(c(ac|b)∗) is immediate. The induction hypothesis now gives us that w ∈ L(c(ac(ac|b)∗|b)∗). It is easily seen that w starts with a c which is followed by some concatenation of ac and b strings, so w ∈ L(c(ac|b)∗), which proves the Lemma. 

The fourth rewriter r4 uses that finite union distributes over concatenation to transform the regular expressions produced by P to a normal form:

a(b|c)d ⇒ abd|acd

In this pattern, a, b, cd are regular expressions. First note that r4 is confluent. r4 is then easily seen to be language preserving and iterable (eventually no concatenation operators will have finite unions as their operands). Note that we apply r4 after r2: the standard algorithm for turning regular grammars into regular expressions first performs r4 and then (a simplified version of) r2, but as our r2 uses a slight generalization of Arden’s lemma, it may be able to take advantage of unexpanded finite unions.

Let r be the iterated rewriter of the first-that-works rewriter of r1, r2, r3, r4. r is language preserving by earlier remarks and by Lemmas 10, 11, and 12. Recall that a rewriter ‘succeeds’ if the result is atomical – the language of the grammar is then regular. We have the following results:

40 Lemma 13 If G is a system of regular expressions that corresponds to a regular grammar, then r(G) is atomical.

Proof. We assume our grammar is right-regular (that is, its rules are of the form A → aB or A → ε where a is a concatenation of one or more terminals) – the proof for left-regular grammars is analogous. Unfortunately, r does not behave in the same way as the standard algorithm for converting regular grammars into regular expressions, because r2 is somewhat aggressive in rewriting. For the rest of this proof, we denote the iterated application of the pattern used by r4 toaregular i expression a as r4(a). We will prove that the following property is maintained throughout the rewriting process: for i all A, r4(P (A)) will have the form a1X1| ...|akXk|b,wherea1,...,ak,b are regular expressions over T , so they do not contain any nonterminals. Note that our property implies that r3 will never be applicable, for r4 does not move or remove nonterminals inside arguments of a closure operator.

First, we will see that r1 preserves the property. Suppose A is removed by r1 and P (B) con- i tains a reference to A that is rewritten to P (A). By assumption, r4(P (A)) = a1X1| ...|akXk|b i      and r4(P (B)) = c1A| ...|clA|a1X1| ...|alXm|b (after reordering the operands of the | opera- tors). Let w = P (B) after the system is rewritten by r1. Using the confluence of r4 and the as- i     sociativity of |, it follows that r4(w)=c1a1X1| ...|c1akXk|c2a1X1| ...|xlakXk|a1X1| ...|alXm|b still has our property. ∗ ∗ Nowwewillprovethatr2 preserves this property. Suppose the pattern d (aX|b)c occurs anywhere in P (A)forsomeA (the case for the pattern c∗(Xa|b)d∗ is analogous). As our property holds for P (A), as X is a nonterminal and as d∗(aX|b)c∗ can be turned into d∗aXc∗|d∗bc∗ by using the pattern of r4,wemusthavethatd and a do not contain any nonterminals and that c = ∅. This means (a|d)∗ does not contain any nonterminal either. b may contain a nonterminal, i but r4(b) must have our property. Prepending (a|d)∗ to b cannot cause our property to be ∗ ∗ violated, and so we conclude that (a|d) bc also has our property, thus proving that r2 indeed preserves our property.

r3 can never change the system as noted above. r4 obviously preserves our property as well. As the property holds initially in the right-regular grammar, we therefore conclude that our property holds throughout the rewriting process of r.

As r iteratively applies r1, r2, r3 and r4, r(G) is a system such that none of r1, r2, r3 or r4 would change it if applied to it. We will show that if this happens, the system must be atomical (in our setting, so when G is a regular grammar).

Firstly, as r4 is no longer applicable, by our property proven above P (A) must have the form  a1X1| ...|akXk|b for all A. Now note that if any of X1,...,Xk is equal to A, Xi say, then r2 is applicable with c = d = ∅, a = ai and b = a1X1| ...|ai−1Xi−1|ai+1Xi+1| ...|akXk|b.We therefore conclude that A is not self-recursive. This means that r1 may be applicable: the only case in which it is not is if A = S, but this means that S is not self-recursive, which by definition means that the system is atomical, as required. This proves the lemma. 

41 Theorem 14 If G is a system of regular expressions that corresponds to an expression gram- mar, then r(G) is atomical.

Proof. The system of regular expressions corresponding to an expression grammar has a sin- gle nonterminal E and P (E)=Ea1E| ...|EakE|b1E| ...|blE|Ec1| ...|Ecm|. Applying r1 will never do anything on this system, as only have one nonterminal and none of our rewriters in- troduce new nonterminals. r2 can be applied k + l + m times, resulting (up to reordering of finite unions and choices made how to resolve EaiE rules) in P (E)= ∗ ∗ (b1| ...|bl|Ea1| ...|Eak) (c1| ...|cm) . Applying r3 k times then gives P (E)= ∗ ∗ ∗ ∗ (b1| ...|bl|(c1| ...|cm) a1| ...|(c1| ...|cm) ak) (c1| ...|cm) . At this point, none of our rewrit- ing rules are applicable anymore. The resulting system is atomical, as required. 

6.4 Structured Regular Expressions

We now define the concept of structured regular expressions over some set of vocabulary symbols V and with respect to a numbered set P  of productions. Structured regular expressions are ordered trees with five types of nodes. The first type of node are the leaf nodes, labeled by an element in V . The second type of node is an application, labeled with an index i into P with children a1,...,ak (in that order), where ai may be any type of node. Let α be the right-hand side of the production in P indexed by i.Werequirethatα = a1 ...ak. Applications are denoted as [i : α].

The third type of node is an alternation, having two children r1 and r2 which are not allowed to be leaf nodes. Alternations are denoted as r1|r2. Finally, the fourth and fifth types of nodes are the left iterationsandright iterations. These both have two children r1 and r2 (in that ∗ order) which are not allowed to be leaf nodes. Left iterations are denoted as [(r2) : r1]and ∗ right iterations as [r1 :(r2) ]. The root of a structured regular expression may not be a leaf node. We will often identify an internal node as a structured regular expression by taking the subtree rooted at that node. Just like with regular expressions, we can define a system of structured regular equations with respect to a numbered set P  of productions. The only difference in the definition is that P returns structured regular expressions with respect to P .Weusethesamedefinitionsforself- recursive nonterminals (a leaf node of P (A) is labeled A) and the atomicness of a system (P (S) does not have leaf nodes labeled with a nonterminal). To convert from a context-free grammar to a system of structured regular equations, we first assume some indexing on the elements of P . We let T , N and S stay the same as in the grammar. For a given A ∈ N,letw1,...,wk be the distinct right-hand sides of all (A, wi) ∈ P ,andleti1,...,ik be the indices of these productions in P . We define P (A)=[i1 : w1]| ...|[ik : wk]. We can also convert systems of structured regular equations into context-free grammars. Let G =(T,N,S,P) be a system of structured regular equations with respect to a set of productions P ,thenG =(T,N,S,P) is the grammar induced by G. We will now define for a system of structured regular equations G and a given nonterminal A ∈ N the set of derivation trees D(A) corresponding to A. We define for every node r in every structured regular expression in D asetD(r), whose definition depends on the sets given by D of its children, and in the case of a leaf node labeled by an element A ∈ N, the set D(A). We then take the least fixed point over these definitions to fix the value of all these sets simultaneously. In order for this definition to make sense, we make assumptions about G in our definition.

42 If G satisfies these assumptions, we call it consistent with P  (or just consistent). We will also define the notion of the head h(r) of a node of a structured regular expression, which we will use to define consistency. Similarly, we will define the recursedness of a node. This property can have the values left-recursed on X, right-recursed on X (both for every X ∈ N)andnon- recursed. If a node r is something other than non-recursed, the trees we define for D(r) will have precisely one dummy leaf node labeled $ ($ ∈ V ). For every internal node of every derivation tree in our sets defined by D, we will associate some application node of a structured regular expression defined by P . In our definition of D we sometimes create derivation trees by using already defined derivation trees as subtrees of the new tree. For the nodes of these subtrees of this new tree, we define the associations to be equal to the associations of the nodes of the old derivation trees. This way, we know for every internal node of a derivation tree which application node ‘created’ this internal node. If r is a leaf node labeled x,thenh(x)=x, r is non-recursed and D(x) is the singleton set containing a leaf node labeled x if x ∈ T ,andD(P (x)) otherwise. Suppose r =[i : α]. Let X → β be the ith production in P .Wedefineh(r)=X.Ifβ = ε, then the ordered binary tree consisting of a node labeled X (this node is associated with r) with a single child labeled ε is in D(r).

Now assume β = ε.Letα = r1 ...rk be the children of r (in that order). If β = h(r1) ...h(rk) and r1,...,rk are all non-recursed, then r is non-recursed. If β = h(r1) ...h(rk), r1,...,rk−1 are non-recursed and rk is right-recursed on Y ,thenr is right-recursed on Y .Ifβ = h(r1) ...h(rk), r2,...,rk are non-recursed and r1 is left-recursed on Y ,thenr is left-recursed on Y .Inanyof these three cases, for every t1 ∈ D(r1),...,tk ∈ D(rk), we define that for the ordered binary tree t whoserootnodeislabeledX (and associated with r) and whose children are t1,...,tk in that order, t ∈ D(r).

If β = h(r1) ...h(rk)Y and r1,...,rk are non-recursed, then r is right-recursed on Y , and for every t1 ∈ D(r1),...,tk ∈ D(rk), we define that for the ordered binary tree t whose root node is labeled X (and associated with r) and whose children are t1,...,tk, $inthatorder,t ∈ D(r). If β = Yh(r1) ...h(rk)andr1,...,rk are non-recursed, then r is left-recursed on Y , and for every t1 ∈ D(r1),...,tk ∈ D(rk), we define that for the ordered binary tree t whose root node is labeled X (and associated with r) and whose children are $,t1,...,tk in that order, t ∈ D(r).

For consistency of G, we assume that one of the above cases holds. Now suppose r = r1|r2. Assume for consistency that h(r1)=h(r2) and that the recursedness of r1 and r2 are the same. Define D(r)asD(r1) ∪ D(r2), h(r)=h(r1) and the recursedness of r as the recursedness of r1. ∗ ∗ Finally, suppose r =[(r2) : r1](thecasethatr =[r1 :(r2) ] is analogous). For consistency of G, assume r2 is right-recursed on h(r1)andh(r2)=h(r1). Define h(r)=h(r1)andthe recursedness of r as the recursedness of r1.Ift ∈ D(r2), then t ∈ D(r). If t1 ∈ D(r)and t2 ∈ D(r2), then the tree t obtained by replacing the dummy node $ in T2 by t1 is also in D(r). Our final assumption for consistency of G is that D(P (A)) is non-recursed. Define D(G)=  D(S), then it is easy to see that D(G) is equal to the set of derivation trees of G , PTG ,ifG is the system obtained by transforming a context-free grammar G into G. We define for every A ∈ N the language of A as L(A)={y(t) | t ∈ D(A)}, and the language of the system as L(G)=L(S).

43 6.5 Rewriting Rules for Structured Regular Expressions

6.5.1 Defining Rewriters

A structured regular equation rewriter (or just ‘rewriter’ in this section) is a function that takes a system of structured regular equations G =(T,N,S,P) with respect to P  and results in a pair of resulting values (G, F), where G =(T ,N,S,P) is a system of structured regular equations with respect to P  and F a filter, as defined in Chapter 3. A rewriter is correct if for every consistent G, G is consistent and F(D(G)) = D(G). In order to be able to compose rewriters, we define intermediate rewriters for every disam- biguation mechanism M. Intermediate rewriters take a system of structured regular equations G and a set of auxiliary data Q for M and return a pair of resulting values (G,Q)where G =(T ,N,S,P) is a system of regular equations and Q a set of auxiliary data. Let G be the grammar induced by G. We say an intermediate rewriter is correct if, assuming G is consistent      and D(G)=M(G ,Q)(PTG ), it holds that G is consistent and D(G )=M(G ,Q)(PTG ). Wecanturnanintermediaterewriterri intoarewriterr as follows. Let Q be a set of aux- iliary data for M such that M(G,Q) is the identity function (such a Q exists for all our   disambiguation mechanisms, usually by setting Q = ∅), and let ri(G, Q)=(G ,Q), then    r(G)=(G ,M(G ,Q)). Note that r is correct if r1 is correct.

Given a sequence of intermediate rewriters r1,...,rk, we define the first-that-works inter- mediate rewriter r as follows: for every i with 0 ≤ i

6.5.2 Enforcing Filters

We will first define a function yielding a correct intermediate rewriter that forces a consistent system of structured regular equations to adhere to a set of precede and follow restrictions Q =(P, F). We name this function R. The remainder of this subsection defines G,afterwhich we will define R(Q)(G, F )=(G,F). The first phase of the intermediate rewriter is to mark nodes in the structured regular expressions as ‘needs to be removed’. The second phase then removes these nodes. Let r =[i : α] be an internal node. We define

Precede(r)={a ∈ T |∃t ∈ D(S),n∈ t, n is associated with r, a ∈ Precede(n)} Follow(r)={a ∈ T |∃t ∈ D(S),n∈ t, n is associated with r, a ∈ Follow(n)}

Let p be the ith production in P . We mark r as ‘needs to be removed’ if (p, a) ∈ P for some a ∈ Precede(r)orif(p, a) ∈ F for some a ∈ Follow(r). Note that this measure, while correct, may remove too much from the system: if for example (p, a) ∈ P for some but not all a ∈ Precede(r), then L(G) need not equal L(G). It may be possible to make cleaver

44 modifications to G to remove r only if forbidden terminals precede it, but we did not research this further. We will now propagate the ‘needs to be removed’ flag upwards. If r =[i : α]isaninternal node and one of its children is marked, we mark r as well. If r = r1|r2 and if both r1 and r2 are ∗ ∗ marked, we mark r as well. If r =[(r2) : r1]orr =[r1 :(r2) ]andifr1 is marked, we mark r as well. Finally, if P (A) is marked for some A ∈ N, then we mark all leaf nodes labeled A of P (B) for all B ∈ N. We now define the second phase of the rewriter. It first removes all A ∈ N and its associated value P (A)ifP (A) is marked as ‘needs to be removed’. This phase performs a pre-order walk over the tree P (A) for all A ∈ N, possibly modifying the nodes that are visited. If a node is modified, we visit the node again. If a node did not change during a visit, we continue the pre-order walk on its most recent set of children (so not on the children the node had before it was visited). ∗ ∗ Let r be the node we consider in the pre-order walk. If r =[(r2) : r1]orr =[r1 :(r2) ] and if r2 is marked, we replace r by r1 (or equivalently, we modify the type and properties of r to match r1, and replace its children by the children of r1.Ifr = r1|r2 and either r1 or r2 is marked, we replace r by the node that is not marked among r1 and r2 (or again equivalently, we modify the type and properties of the node r to match the replacement node, and replace its children by the children of the replacement node). Note that this rewriter may end up making N = ∅ – we could fix this by introducing some way of allowing P (S) to be a structured regular expression denoting the empty set, but we decided to ignore this problem rather than making our definitions even more complex. It is straightforward to verify that R indeed works as advertised.

6.5.3 Our Rewriters

We will now give a sequence of correct iterable intermediate rewriters for the precede-follow restriction disambiguation mechanism that correspond to the rewriters of Section 6.3. The rewriter that we use in our strategy will be the non-intermediate rewriter of the iterated inter- mediate rewriter of the first-that-works intermediate rewriter of this sequence.

The first intermediate rewriter r1 we use takes nonterminals other than S that are not self- recursive and removes these nonterminals from the grammar, by replacing all occurrences of  such a nonterminal A by P (A) in all results of P . r1 does not change Q and so returns Q = Q.

Lemma 15 r1 is correct and iterable.

Proof. It is easily verified that r1 transforms consistent inputs to consistent outputs. Correct- ness is then immediate as we defined D(A)asD(P (A)) for every A ∈ N. r1 either decreases the number of nonterminals by 1 or returns the input system, so it is iterable. 

When trying to make a rewriter corresponding to the r2 of the previous section, we discovered that this was a lot more complicated than we initially thought. The r2 we define here therefore only attempts to rewrite regular grammars. We do not give an r3 either, as this is not needed to rewrite regular grammars. A new r2 should be introduced if one wishes to extend our strategy to expression grammars. One might try to create a pattern of alternating left and right iteration nodes, each corresponding to a set of pre- or postfix operators, but then derivation trees like the ones in Section 4.1 are no longer allowed, so a much more complicated pattern is needed.

45 We claimed earlier that rewriting naturally gives rise to auxiliary data for disambiguation mechanisms. Consider E → aE|Eb|. We get [[1 : aE]|[3 : ]:([2:b])∗]. At this point we have all the information we need to either forbid a preceding Eb or forbid b following aE,as claimed. Unfortunately, we have no example of a rewriter that introduces auxiliary data.

We now define the intermediate rewriter r2. We only define how to deal with right recursion: r2 should also rewrite left recursion, whose definition is just the (left-right) reverse of right recursion removal. It attempts an X that satisfies the following. We write P (X)=a1| ...|ak ∗ or P (X)=[(a0) : a1| ...|ak]. We walk down every aj, deciding which child to visit next based on the following rules. For application nodes, we follow the rightmost child, if one exists. For left iteration nodes, we follow the left child (so the non-iterated child). For right iteration nodes and alternation nodes, we halt our walk. We attempt to find such a path down the tree that ends in a leaf node labeled X. Let its parent be [i : αX] (recall that leaf nodes are children of only application nodes). We change this parent to [i : α]. If P (X)=a1| ...|ak,wemodify ∗ ∗ P (X)to[(ai) : a1| ...|ai−1|ai+1| ...|ak]. If P (X)=[(a0) : a1| ...|ak], we modify P (X)to ∗ [(a0|ai) : a1| ...|ai−1|ai+1| ...|ak]. r2 does not change Q.

Lemma 16 r2 is correct and iterable.

Proof. We first note that iterability is easily seen as the number of self-references goes down by 1 after every application of the rewriter. Furthermore, it is easy to see that consistent systems are mapped to consistent systems by r2. Finally, as r2 is basically Kleene’s lemma, it is easy to check that D(G)=D(G), as required. 

The fourth rewriter r4 uses that finite union distributes over concatenation to transform the structured regular expressions produced by P to a normal form:

[i : a1 ...al−1(b|c)al+1 ...ak] ⇒ [i : a1 ...al−1bal+1 ...ak]|[i : a1 ...al−1cal+1 ...ak]

In this pattern, b, c, a1,...,al−1,al+1,...,ak are structured regular expressions. r4 does not change Q.Asr4 is again confluent, r4 is easily seen to be iterable (eventually no concatenation operators will have finite unions as their operands). Correctness is also easily verified. Let r be the non-intermediate version of the iterated intermediate rewriter of the first-that- worksintermediaterewriterofr1, r2, r4. r is correct by earlier remarks and by Lemmas 15 and 16. We have the following result:

46 Lemma 17 If G is a system of structured regular expressions that corresponds to a regular grammar, then r(G) is atomical. Furthermore, L(G)=L(r(G)).

Proof. The proof of this lemma is essentially the same as Lemma 13 – we again consider only right regular grammars. The property we use (the structured equivalent of the property i used in the proof of Lemma 13) is that for all A ∈ N, r4P (A) will have the form a1| ...|ak|b or ∗ [(b1| ...|bl) : a1| ...|ak|b], where b has no leaf nodes labeled by a nonterminal and if we consider the chain of nodes in ai or bj by starting at the root and repeatedly following the rightmost child, then the last node of this chain is a nonterminal leaf node and all other nodes are either applications or left iterations, and in either case their non-rightmost children do not have leaf node descendants labeled by nonterminals. By an argument similar to the one used in the proof of Lemma 13, we conclude that the property is maintained by r1, r2 and r4. By correctness of r1, r2 and r4 and because they do not introduce precede-follow restrictions, we conclude PTG = D(r(G)) and hence L(G)=L(r(G)). Finally, by an argument similar to the one used in the proof of Lemma 13, we conclude that r(G)isatomical. 

47

Chapter 7

Implementation

We have made a proof-of-concept implementation of our rewriting-based disambiguation strat- egy. It implements most of the steps of our strategy. First, the user enters a grammar. This grammar is converted to a system of regular equations and to a system of structured regular equations. These are then rewritten by our rewriters. Unfortunately, we did not manage to include all rewriters described in the previous section in our implementation, so the program usually gets stuck, in particular on structured regular expressions. We did not implement rewrit- ers that suggest disambiguation rules, nor did we implement a framework that can handle such rules. If the rewriting succeeds, we can test the structured regular expression for any remaining ambiguities. We can then compare the structured regular expression to the regular expression for language equality. Most of the framework needed for our strategy is therefore present. The algorithms behind these tests are based on relational expressions and the efficient algorithms for them [27]. The proof-of-concept has been useful in trying out proposed rewriters on larger expression grammars, to see whether they work in the general case. Small examples can be tried on a white- board, but when the examples are large enough our implementation was useful to quickly and precisely test whether the rewriters worked. It also provides a nice basis for an implementation for a more advanced version of our strategy.

Specifying the grammar

The user can specify a set of productions. Terminals and nonterminals are represented as consecutive characters without spaces. Juxtaposition is interpreted as concatenation, | as alter- nation, ( and ) as parenthesized expressions and ∗ as the Kleene star operation. Productions may therefore have regular expressions as right-hand sides. If any right-hand side uses a Kleene star operation or an alternation, the conversion to a system of structured regular expression fails (as applications cannot be mapped to original productions), but the conversion to a system of regular expressions still works – this is useful to test rewriters. A screenshot of this screen can be found in Figure 7.1. Left-hand sides of productions must be consecutive character without spaces, and these are assumed to be exactly the nonterminals. All other identifiers are assumed to be terminals. The nonterminal on the left-hand side of the first production is considered the starting symbol. Productions are numbered consecutively from 0. Productions are automatically parsed by our handwritten parser. If a production has an invalid left- or right-hand side, we use the last correct interpretation if there is one. The

49 Figure 7.1: A screenshot of the grammar input screen. interpreted version is displayed, where nonterminals are prefixed by an & symbol.

Rewriting

In the rewriting screen the user can press a button to let the rewriters perform a single change to the equation. This way, the user can see how his grammar is rewritten. A reset button is also available that removes all current productions and restores the initial system of regular equations obtained from the input grammar. A screenshot of this screen can be found in Figure 7.2.

Figure 7.2: Two screenshots of the rewriting screen. The ‘rewrite’ button has been pressed four times in the first screen. For the second screen the rewriter has finished simplifying the expression.

Structured Rewriting

The structured rewriting screen is almost the same as the rewriting screen. Just like the rewrit- ing screen, there is a button that makes the rewriter take a single step, and a button to reset the screen. A screenshot of this screen can be found in Figure 7.3. If the rewriter can make no more steps and the system of structured regular equation has been reduced to a single structured regular expression, the ‘rewrite’ button is relabeled ‘perform ambiguity test’. Pressing the button calculates whether the regular expression is still ambiguous.

50 Figure 7.3: Two screenshots of the structured rewriting screen. The ‘rewrite’ button has been pressed four times in the first screen. The second screen shows a different grammar E → aE, E → x, after the ‘rewrite’ button has been pressed twice.

Finally, there is a button to test whether the regular expression obtained from the regular rewriting screen represents the same language as the structured regular expression obtained from the structured regular rewriting screen. For this button to do anything, both rewriters need to be fully finished.

Figure 7.4: A screenshot showing the result of a language equality test.

Future work

As mentioned before, our program does not implement all rewriters described in Chapter 6. It has also no options to suggest disambiguation rules, or to enforce these rules in the rest of the grammar. As this is the very reason we introduced this strategy, implementing these features in future work would be required for the program to become useful. Of the suggested improvements of our strategy, we implement only a single rewriting option that replaces α∗α∗ by α∗. Implementing a number of the improvements seems needed in order for our strategy to work on complete programming languages. It would therefore be interesting to try the program on examples to see how well the strategy works and where it breaks down. Given this information, one could then implement some of our suggested improvements and try these examples again. This would either show the limits of our strategy or improve it sufficiently to take on full programming languages.

51

Chapter 8

Conclusion

We have introduced a parser independent framework for describing work done in disambiguation. We have classified methods that describe how to disambiguate grammars as disambiguation strategies. In these strategies there are some common patterns, in particular in how strategies decide which parse trees are the ‘right’ ones: we have classified these common patterns as disambiguation mechanisms. Once we recognize a method as a strategy, we can rank it on the six quality criteria we have presented to determine the strengths and weaknesses of the strategy. By comparing disambigua- tion mechanisms on the two quality criteria we have presented, we can prove facts about all the strategies that use them, and if a new strategy uses a known disambiguation mechanism, this immediately gives us some information about how the strategy ranks on our quality criteria. We note that our quality criteria are designed to rank strategies and mechanisms that aim to disambiguate ‘grammar-level’ constructs, rather than lexical or semantical ambiguities: one will probably need different criteria for such ambiguities. We have presented a list of disambiguation mechanisms that have been introduced in the literature. We have evaluated a number of relevant mechanisms on their relative and absolute applicability. We found that their applicabilities are incomparable: no combination of mecha- nisms is strictly more powerful than any other mechanism. We also investigated their absolute applicability by investigating whether they could disambiguate expression grammars with unary and binary operators. One of these results is that the disambiguation mechanisms used by Yacc and Bison work as advertised on these grammars. We have also presented a list of disambiguation strategies that have been introduced in the literature. We have evaluated all these disambiguation strategies using our quality criteria. We have used the results proven about disambiguation mechanisms in this ranking, showing that this concept is indeed useful when ranking strategies. Finally, we have presented our own disambiguation strategy, based on rewriting rules. To this end, we introduced systems of regular equations, rewriters of such systems, structured regular expressions, systems of structured regular equations and rewriters of such systems. We described several rewriters and showed they work on regular grammars and (partially) on expression grammars. Our strategy ranks very well on all but one of our quality criteria. We have suggested three options to improve the strategy on this final quality criteria as part of future work. We have given a brief overview of a proof-of-concept implementation of our strategy. This implementation implements most, but not all steps of our disambiguation strategy. It has helped us with testing our rewriters on large examples and could serve as a good basis for an implementation of a more advanced version of our strategy as part of future work.

53

Bibliography

[1] A. Afroozeh. Gtext. 2012.

[2] A. V. Aho and S. C. Johnson. LR parsing. ACM Comput. Surv., 6(2):99–124, June 1974.

[3] A. V. Aho, S. C. Johnson, and J. D. Ullman. of ambiguous grammars. Commun. ACM, 18(8):441–452, August 1975.

[4] A. V. Aho and J. D. Ullman. The theory of parsing, translation, and compiling, volume I. Prentice Hall, Upper Saddle River, NJ, USA, 1972.

[5] C. Allauzen, M. Mohri, and A. Rastogi. General Algorithms for Testing the Ambiguity of Finite Automata and the Double-Tape Ambiguity of Finite-State Transducers. Int. J. Found. Comput. Sci., pages 883–904, 2011.

[6] H. J. Basten and J. J. Vinju. Parse Forest Diagnostics with Dr. Ambiguity. In A. Sloane and U. Aßmann, editors, Software Language Engineering, volume 6940 of Lecture Notes in Computer Science, pages 283–302. Springer Berlin Heidelberg, 2012.

[7] C. Bastien, J. Czyzowicz, W. Fraczak, and W. Rytter. Prime normal form and equivalence of simple grammars. In Proceedings of the 10th international conference on Implementation and Application of Automata, CIAA’05, pages 78–89, Berlin, Heidelberg, 2006. Springer- Verlag.

[8] M. G. Brand, J. Scheerder, J. J. Vinju, and E. Visser. Disambiguation Filters for Scan- nerless Generalized LR Parsers. In R. N. Horspool, editor, Construction, volume 2304 of Lecture Notes in Computer Science, pages 143–158. Springer Berlin Heidelberg, 2002.

[9] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.

[10] E. W. Dijkstra. Algol 60 translation : An Algol 60 translator for the x1 and Making a translator for Algol 60. Technical Report 35, Mathematisch Centrum, Amsterdam, 1961.

[11] J. Earley. An efficient context-free parsing algorithm. Commun. ACM, 13(2):94–102, February 1970.

[12] J. Earley. Ambiguity and precedence in syntax description. Acta Informatica, 4(2):183–192, 1975.

[13] W. R. LaLonde and J. des Rivieres. Handling operator precedence in arithmetic expressions with tree transformations. ACM Trans. Program. Lang. Syst., 3(1):83–103, January 1981.

[14] Microsoft Transact-SQL (SQL Server 2012). Operator precedence table. http://msdn. microsoft.com/en-us/library/ms190276.aspx. Accessed: 2013-06-15.

55 [15] MySQL 5.6. Operator precedence table. http://dev.mysql.com/doc/refman/5.6/en/ operator-precedence.html. Accessed: 2013-06-15. [16] A. Nijholt. The Equivalence Problem for LL- and LR-Regular Grammars. In Proceedings of the 1981 International FCT-Conference on Fundamentals of Computation Theory,FCT ’81, pages 291–300, London, UK, UK, 1981. Springer-Verlag.

[17] T. Olshansky and A. Pnueli. A direct algorithm for checking equivalence of LL(k) gram- mars. Theoretical Computer Science, 4(3):321–349, 1977.

[18] M. Oyamaguchi, N. Honda, and Y. Inagaki. The equivalence problem for real-time strict deterministic languages. Information and Control, 45(1):90–115, 1980.

[19] Perl 5 version 16.3. Operator precedence table. http://perldoc.perl.org/perlop.html# Operator-Precedence-and-Associativity. Accessed: 2013-06-15.

[20] Python 2.7.5. Operator precedence table. http://docs.python.org/2.7/reference/ expressions.html#operator-precedence. Accessed: 2013-06-15.

[21] Python 3.3.2. Operator precedence table. http://docs.python.org/3.3/reference/ expressions.html#operator-precedence. Accessed: 2013-06-15.

[22] Ruby 2.0. Operator precedence table. http://www.ruby-doc.org/core-2.0/doc/ syntax/precedence_rdoc.html. Accessed: 2013-06-15. [23] D. J. Salomon and G. V. Cormack. Scannerless NSLR(1) parsing of programming lan- guages. SIGPLAN Not., 24(7):170–178, June 1989.

[24] S. Schmitz. An experimental ambiguity detection tool. Science of Computer Programming, 2:71–84, 2010.

[25] E. Scott and A. Johnstone. GLL parsing. In 9th Workshop on Language Descriptions Tools and Applications (LDTA), volume 253 of Electronic Notes in Theoretical Computer Science, pages 177–189. Elsevier, September 2010.

[26] E. Scott and A. Johnstone. GLL parse-tree generation. Science of Computer Programming, April 2012.

[27] S. Sippu and E. Soisalon-Soininen. Parsing theory. Vol. 1: languages and parsing. Springer- Verlag New York, Inc., New York, NY, USA, 1988.

[28] E. Soisalon-Soininen and J. Tarhio. Looping LR parsers. Information processing letters, 26(5):251–253, 1988.

[29] C. Stirling. Deciding DPDA Equivalence Is Primitive Recursive. In P. Widmayer, S. Eiden- benz, F. Triguero, R. Morales, R. Conejo, and M. Hennessy, editors, Automata, Languages and Programming, volume 2380 of Lecture Notes in Computer Science, pages 821–832. Springer Berlin Heidelberg, 2002.

[30] M. Thorup. Controlled grammatic ambiguity. ACMTrans.Program.Lang.Syst., 16(3):1024–1050, May 1994.

[31] M. Thorup. Disambiguating grammars by exclusion of sub-parse trees. Acta Informatica, 33(6):511–522, 1996.

[32] E. Ukkonen. The equivalence problem for some non-real-time deterministic pushdown automata. J. ACM, 29(4):1166–1181, October 1982.

56 [33] E. Visser. Syntax definition for language prototyping. PhD thesis, University of Amsterdam, 1997.

[34] R. M. Wharton. Resolution of ambiguity in parsing. Acta Informatica, 6(4):387–395, 1976.

57