One Parser to Rule Them All

Ali Afroozeh Anastasia Izmaylova Centrum Wiskunde & Informatica, Amsterdam, The Netherlands {ali.afroozeh, anastasia.izmaylova}@cwi.nl

Abstract 1. Introduction Despite the long history of research in , constructing Parsing is a well-researched topic in computer science, and parsers for real programming languages remains a difficult it is common to hear from fellow researchers in the field and painful task. In the last decades, different parser gen- of programming languages that parsing is a solved prob- erators emerged to allow the construction of parsers from a lem. This statement mostly originates from the success of BNF-like specification. However, still today, many parsers [18] and its underlying theory that has been developed are handwritten, or are only partly generated, and include in the 70s. Since Knuth’s seminal paper on LR parsing [25], various hacks to deal with different peculiarities in program- and DeRemer’s work on practical LR parsing (LALR) [6], ming languages. The main problem is that current declara- there is a linear parsing technique that covers most syntactic tive syntax definition techniques are based on pure context- constructs in programming languages. Yacc, and its various free grammars, while many constructs found in program- ports to other languages, enabled the generation of efficient ming languages require context information. parsers from a BNF specification. Still, research papers and In this paper we propose a parsing framework that em- tools on parsing in the last four decades show an ongoing braces context information in its core. Our framework is effort to develop new parsing techniques. based on data-dependent grammars, which extend context- A central goal in research in parsing has been to en- free grammars with arbitrary computation, variable binding able language engineers (i.e., language designers and tool and constraints. We present an implementation of our frame- builders) to declaratively build a parser for a real program- work on top of the Generalized LL (GLL) parsing , ming language from an (E)BNF-like specification. Never- and show how common idioms in syntax of programming theless, still today, many parsers are hand-written or are languages such as (1) lexical disambiguation filters, (2) op- only partially generated and include many hacks to deal erator precedence, (3) indentation-sensitive rules, and (4) with peculiarities in programming languages. The reason conditional preprocessor directives can be mapped to data- is that grammars of programming languages in their sim- dependent grammars. We demonstrate the initial experience ple and readable form are often not deterministic and also with our framework, by parsing more than 20 000 Java, #, often ambiguous. Moreover, many constructs found in pro- Haskell, and OCaml source files. gramming languages are not context-free, e.g., indentation rules in Haskell. Parser generators based on pure context- Categories and Subject Descriptors D.3.1 [Programming free grammars cannot natively deal with such constructs, Languages]: Formal Definitions and Theory—Syntax; D.3.4 and require ad-hoc extensions or hacks in the lexer. There- [Programming Languages]: Processors—Parsing fore, additional means are necessary outside of the power of context-free grammars to address these issues. Keywords Parsing, data-dependent grammars, GLL, dis- General parsing [7, 33, 35] support all context- ambiguation, operator precedence, offside rule, preprocessor free grammars, therefore the language engineer is not lim- directives, , context-aware scanning ited by a specific deterministic class, and there are known declarative disambiguation constructs to address the prob- lem of ambiguity in general parsing [12, 36, 38]. However, implementing disambiguation constructs is notoriously hard and requires thorough knowledge of the underlying parsing technology. This means that it is costly to declaratively build a parser for a given in the wild if the required disambiguation constructs are not already sup- Copyright ©ACM, 2015. This is the author’s version of the work. It is posted here ported. Perhaps surprisingly, examples of such languages are by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Onward!’15, October 25–30, 2015, Pittsburgh, PA, USA. not only the legacy languages, but also modern languages http://dx.doi.org/10.1145/2814228.2814242. such as Haskell, Python, OCaml and C#.

1 In this paper we propose a parsing framework that is able Our vision is a parsing framework that provides the right to deal with many challenges in parsing existing and new level of abstraction for both the language engineer, who de- programming languages. We embrace the need for context signs new languages, and the tool builder, who needs a pars- information at runtime, and base our framework on data- ing technology as part of her toolset. From the language en- dependent grammars [16]. Data-dependent grammars are an gineer’s perspective, our parsing framework provides an out extension of context-free grammars that allow arbitrary com- of the box set of high-level constructs for most common id- putation, variable binding and constraints. These features al- ioms in syntax of programming languages. The language en- low us to simulate hand-written parsers and to implement gineer can also always express her needs directly using data- disambiguation constructs. dependent grammars. From the tool builder’s perspective, To demonstrate the concept of data-dependent grammars our framework provides open, extensible means to define we use the IMAP protocol [29]. In network protocol mes- higher-level syntactic notation, without requiring knowledge sages it is common to send the length of data before the ac- of the internal workings of a parsing technology. tual data. In IMAP, these messages are called literals, and The contributions of this paper are: are described by the following (simplified) context-free rule: L8 ::= ' {' Number '}' Octets • We provide a unified perspective on many important chal- ⇠ lenges in parsing real programming languages. Here Octets recognizes a list of octet (any 8-bit) values. An example of L8 is ~{6}aaaaaa. As can be seen, there is no • We present several high-level syntactical constructs, and data dependency in this context-free grammar, but the IMAP their mappings to data-dependent grammars. specification says that the number of Octets is determined by • We provide an implementation of data-dependent gram- the value parsed by Number. Using data-dependent grammars, mars on top of Generalized LL (GLL) parsing [33] that we can specify such a data-dependency as: runs over ATN grammars [40]. The implementation is 1 L8 ::= ' {' nm:Number {n=toInt(nm.yield)} '}' Octets(n) part of the Iguana parsing framework [2]. ⇠ Octets(n) ::= [n > 0] Octets(n - 1) Octet • | [n == 0] ✏ We demonstrate the initial results of our parsing frame- work, by parsing 20363 real source files of Java, C#, In the data-dependent version, nm provides access to the Haskell (91% success rate), and excerpts from OCaml. value parsed by Number. We retrieve the substring of the input parsed by Number via nm.yield which is converted to integer The rest of this paper is organized as follows. In Section 2 we using toInt. This integer value is bound to variable n, and describe the landscape of parsing programming languages. is passed to Octets. Octets takes an argument that specifies In Section 3 we present data-dependent grammars, our high- the number of iterations. Conditions [n > 0] and [n == 0] level syntactic notation, and the mapping to data-dependent specify which alternative is selected at each iteration. grammars. Section 4 discusses the extension of GLL parsing It is possible to parse IMAP using a general parser, and with data-dependency. In Section 5 we demonstrate the ini- then remove the derivations that violate data dependencies tial results of our parsing framework using grammars of real post parse. However, such an approach would be slow. With- programming languages. We discuss related work in Sec- out enforcing the dependency on the length of Octets during tion 6. A conclusion and discussion of future work is given parsing, given the nondeterministic nature of general pars- in Section 7. ing, all possible lengths of Octets will be tried. There are many common grammar and disambiguation 2. The Landscape of Parsing Programming idioms that can be desugared into data-dependent grammars. Languages Examples of these idioms are operator precedence, longest In this section we discuss well-known features of program- match, and the offside rule. Expecting the language engi- ming languages that make them hard to parse. These features neer to write low-level data-dependent grammars for such motivate our design decisions. cases would be wasteful. Instead, we describe a number of such idioms, provide high-level notation for them and their 2.1 General Parsing for Programming Languages desugaring to data-dependent grammars. For example, using Grammars of programming languages in their natural form our high-level notation the indentation rules in Haskell can are not deterministic, and are often ambiguous. A well- be expressed as follows: known example is the if-then-else construct found in many Decls ::= align (offside Decl)* programming languages. This construct, when written in | ignore('{' Decl (';' Decl) '}') * its natural form, is ambiguous (the dangling-else ambigu- This definition clearly and concisely specifies that either all ity), and therefore cannot be deterministic. Some nonde- declarations in the list are aligned, and each Decl is offsided terministic (and ambiguous) syntactic constructs, such as with regard to its first token (first alternative), or indentation if-then-else, can be rewritten to be deterministic and un- is ignored inside curly braces (second alternative). 1 https://github.com/iguana-parser

2 ambiguous. However, such grammar rewriting in general is comes very important when considering parsing program- not trivial, the resulting grammar is hard to read, maintain ming languages. For example, GLR parsers operate on LR and evolve, and the resulting parse trees are different from automata, and have a rather complicated execution model, as the original ones the grammar writer had in mind. a parsing state corresponds to multiple grammar positions. Instead of rewriting a grammar, it is common to use an The Generalized LL (GLL) parsing algorithm [33] is a , and rely on some implicit behavior of new generalization of recursive-descent parsing that sup- a parsing technology for disambiguation. For example, the ports all context-free grammars, including left recursive dangling-else ambiguity is often resolved using a longest ones. GLL parsers produce a parse forest in cubic time and match scheme provided by the underlying parsing technol- space in the worst case, and are linear on LL parts of the ogy. Relying on implicit behavior of a parsing technology grammar. GLL parsers are attractive because they have the to achieve determinism can make it quite difficult to reason close relationship with the grammar that recursive-descent about the accepted language. Seemingly correct sentences parsers have. From the end user’s perspective, GLL parsers may be rejected by the parser because at a nondetermin- can produce better error messages, and can be debugged in istic point, a wrong path was chosen. For example, Yacc a programming language IDE. is an LALR parser generator, but can accept any context- To deal with left recursive rules and to keep the cu- free grammar by automatically resolving all shift/reduce and bic bound, a GLL parser uses a GSS to handle multiple reduce/reduce conflicts. Using Yacc, the language engineer call stacks. While the execution model of a GLL parser should manually check the resolved conflicts in case of un- is close to recursive-descent parsing, the underlying ma- expected behavior. chinery is much more complicated, and still an in-depth A common theme in research in parsing has been to in- knowledge of GLL is required to implement disambiguation crease the recognition power of tech- constructs. In this paper, we propose a parser-independent niques such as LL(k) or LR(k). One of the widely used gen- framework for parsing programming languages based on eral parsing techniques for programming languages is the data-dependent grammars. We use GLL parsing as the basis Generalized LR (GLR) algorithm [35]. GLR parsers support for our data-dependent parsing framework, as it allows an all context-free grammars and can produce a parse forest intuitive way to implement components of data-dependent containing all derivation trees in form of a Shared Packed grammars, such as environment threading, and enables an Parse Forest (SPPF) in cubic time and space [34]. Note that implementation that is very close to the stack-evaluation the cubic bound is for the worst-case, highly ambiguous based semantics of data-dependent grammars [16]. grammars. As GLR is a generalization of LR, a GLR parser 2.2 On the Interaction between Lexer and Parser runs linearly on LR parts of the grammar, and as the gram- mars of real programming languages are in most parts near Conventional parsing techniques use a separate lexing phase deterministic, one can expect near-linear performance using before parsing to transform a stream of characters to a stream GLR for parsing programming languages. GLR parsing has of tokens. In particular, whitespace and comments are dis- successfully been used in source code analysis and develop- carded by the lexer to reduce the number of lookahead in the ing domain-specific languages [12, 22]. parsing phase, and enable deterministic parsing. General parsing enables the language engineer to use the The main problem with a separate lexing phase is that most natural version of a grammar, but leaves open the prob- without having access to the parsing context, i.e., the appli- lem of ambiguity. In declarative syntax definition [12, 23], cable grammar rules, the lexer cannot unambiguously deter- it is common to use declarative disambiguation constructs, mine the type of some tokens. An example is >> that can ei- e.g., for operator precedence or the longest match. As a gen- ther be parsed as a right shift operator, or two closing angle eral parser is able to return all ambiguities in form of a parse brackets of a generic type, e.g., List> in Java. forest, it is possible to apply the disambiguation rules post- Some handwritten parsers deal with this issue by rewriting parse, removing the undesired derivations from the parse for- the token stream. For example, when the javac parser reads a est. However, such post-parse disambiguation is not practical >> token and is in a parsing state that expects only one >, e.g., in cases where the grammar is highly ambiguous. For exam- when matching the closing angle bracket of a generic type, ple, parsing expression grammars without applying operator it only consumes the first > and puts the second one back to precedence during parsing is only limited to small inputs. prevent a parse error when matching the next angle bracket. Therefore, it is required to resolve ambiguity while parsing To resolve the problems of a separate lexing phase, we to achieve near-linear performance. need to expose the parsing context to the lexer. To achieve Implementing disambiguation mechanisms that are ex- this, the separate lexing phase is abandoned, and the lex- ecuted during parsing is difficult. This is because the im- ing phase is effectively integrated into the parsing phase. We plementation of such disambiguation mechanisms requires call this model single-phase parsing. There are two options knowledge of the internal workings of a parsing technology. to achieve single-phase parsing. The first option is called Therefore, the choice of the general parsing technology be- scannerless parsing [32, 38] where lexical definitions are treated as context-free rules. In scannerless parsing, gram-

3 E ::= '-' E E ::= E '+' T E ::= T E1 fx=case x of gx=case x of |E'*'E |T E1 ::= '+' T E1 | ✏ 0 -> 1 0 -> 1 | E '+' E T ::= T '*'F T ::= F T1 _ -> do _ -> x + 2 | 'a' |F T1 ::= '*'F|✏ let y=2 +4 F ::= '-' F F ::= '-' F y+z | 'a' | 'a' where z=3 gx=case x of 0 -> 1 Figure 1. Three grammars that accept the same language: _ -> x + 2 the natural, ambiguous grammar (left), the grammar with +4 precedence encoding (middle), and the grammar after left- Figure 2. Examples of indentation rules in Haskell. recursion removal (right). ing. There have been other solutions to build parsers from mars are defined to the level of characters. The second op- declarative operator precedence which we discuss in Sec- tion is context-aware scanning [37], where the parser calls tion 6. In Section 3.4 we provide a mapping from operator the lexer on demand. At each parsing state, the lexer is called precedence rules to data-dependent grammars. with the expected set of terminals at that state. In almost all modern programming languages longest 2.4 Offside Rule match (maximal munch) is applied, and keywords are ex- In most programming languages, indentation of code blocks cluded from being recognized as identifiers. These disam- does not play a role in the syntactic structure. Rather, ex- biguation rules are conventionally embedded in the lexer. plicit delimiters, such as begin and end or { and } are used to In single-phase parsing—scannerless or context-aware— specify blocks of statements. Landin introduced the offside longest match and keyword exclusion have to be applied rule [26], which serves as a basis for indentation-sensitive during parsing, by using lexical disambiguation filters such languages. The offside rule says that all the tokens of an ex- as follow restrictions [32, 36]. These disambiguation filters pression should be indented to the right of the first token. have parser-specific implementations [2, 38]. In Section 3 Haskell and Python are two examples of popular program- we show how these filters can be mapped to data-dependent ming languages that use a variation of the offside rule. grammars. Also note that although a context-aware scanner Figure 2 shows two examples of the offside rule in employs longest match, for example by implementing the Haskell. The keywords do, let, of, and where signal the start Kleene star (*) as a greedy operator, in some cases we still of a block where the starting tokens of the statements should need to use explicit disambiguation filters, see Section 3.2. be aligned, and each statement should be offsided with re- gard to its first token. In Figure 2 (left), case has two alterna- 2.3 Operator Precedence tives which are aligned, and the second alternative that spans Expressions are an integral part of virtually every program- several lines is offsided with regard to its first token, i.e., _. ming language. In reference manuals of programming lan- Figure 2 (right) shows two examples that look the same, but guages it is common to specify the semantics of expressions the indentation of the last part, +4, is different. In the top using the priority and associativity of operators. However, declaration +4belongs to the last alternative, but in the bot- the implementation of expression grammars can consider- tom declaration, +4belongs to the expression on the right ably deviate from such precedence specification. hand side of =. It is possible to encode operator precedence by rewriting Indentation sensitivity in programming languages can- the grammar: a new nonterminal is created for each prece- not be expressed by pure context-free grammars, and has dence level. The rewriting is not trivial for real programming often been implemented by hacks in the lexer. For exam- languages, and the resulting grammar becomes large. This ple, in Haskell and Python, indentation is dealt with in the rewriting is particularly problematic in parsing techniques lexing phase, and the context-free part is written as if no that do not support . The left-recursion removal indentation-sensitivity exists. Both GHC and CPython, the transformation disfigures the grammar and adds extra com- popular implementations of Haskell and Python, use LALR plexity in transforming the trees to the intended ones. Fig- parser generators. In Python, the lexer maintains a stack and ure 1 shows three versions of the same expression grammar. emits INDENT and DEDENT tokens when indentation changes. In the 70s, Aho et al. [4] presented a technique in which In Haskell, the lexer translates indentation information into a parser is constructed from an ambiguous expression gram- curly braces and based on the rules specified by mar accompanied with a set of precedence rules. This work the L function [28]. can be seen as the starting point for declarative disambigua- In Section 3.5 we show how data-dependent grammars tion using operator precedence rules. Aho et al. ’s approach can be used for single-phase parsing of indentation-sensitive is implemented by modifying LALR parse tables to re- programming languages in a declarative way. As data- solve shift/reduce conflicts based on the operator prece- dependent grammars are rather low-level for such solutions, dence. However, the semantics of operator precedence in we introduce three high-level constructs: align, offside, and this approach is bound to the internal workings of LR pars- ignore which are desugared to data-dependent grammars.

4 void test() #if X static void Main() { { /* System.Console.WriteLine(@"hello, #if Debug #else #if Debug System.Console.WriteLine("Debug") /**/ class Q { } world } #endif #else #else Nebraska } #endif #endif "); } Figure 3. Problematic cases of using C# directives [30]. Figure 4. C# multi-line string containing directives [30].

2.6 Miscellaneous Features There are many other peculiarities in programming lan- 2.5 Conditional Directives guages and data formats that cannot be expressed by context- free grammars. There has been considerable effort to build Many programming languages allow pragmas that declarative parsers for data formats, e.g. PADS [9], and one specify how the compiler (or the interpreter) processes parts of the main motivations for data-dependent grammars [16] of the input. The C family of programming languages, i.e., is indeed to enable parsing data formats. C, C++ and C#, allow preprocessor directives such as #if Examples of languages that require data-dependent pars- and #define. GHC also allows various compiler pragmas [28, ing are data protocols, such as IMAP and HTTP, and tag- §12.3]. For example, it is possible to enable C preprocessor based languages such as XML. In programming languages, directives in Haskell using {-# LANGUAGE CPP #-}. data-dependent grammars can be used to implement some Preprocessor directives pose considerable difficulty in language-specific disambiguation mechanisms. For exam- parsing programming languages. The main reason is that ple, to maintain a table of type definitions in C to allow re- they are not part of the grammar of a language, but can ap- solving the infamous typedef ambiguity, e.g., in x * y which pear anywhere in the source code. In this regard, preproces- can be either interpreted as a variable of pointer type x or as sor directives are similar to whitespace and comment. How- a multiplication, depending on the type of x. We give an ex- ever, conditional directives may affect the syntactic structure ample of parsing XML and resolving the typedef ambiguity of a program, and cannot be simply ignored as a special kind in C in Section 3.7. of whitespace. This is especially important if we consider single-phase parsing where no lexing/preprocessing is avail- able. We need a mechanism to allow the parser to switch 3. Parsing Programming Languages with between the preprocessor mode and main grammar, and to Data-dependent Grammars evaluate conditional directives to select the right branch. In this section we describe data-dependent grammars [16], Figure 3 (left) shows a C# example where ignoring the discuss our single-phase parsing strategy, and demonstrate conditional directive will lead to a parse error, as the closing how various high-level, declarative syntax definition con- bracket of the test method is in the conditional directive, structs can be desugared into data-dependent grammars. and one of the branches should be included in the input. The example in Figure 3 (right) shows another aspect of 3.1 Data-dependent Grammars directives in C#. If X is true, "/* #else /**/ class Q {}" will Data-dependent grammars are defined as an extension of be considered as part of the source code. If X is false, only the context-free grammars (CFGs), where a CFG is, as usual, else-part will be considered: "/**/ class Q {}". Note that a tuple (N,T,P,S) where when X is false, the if-part does not have to be syntactically • N is a finite set of nonterminals; correct, in this case an unclosed multi-line comment. Among the family of C languages we selected C#, as • T is a finite set of terminals; parsing C# is more manageable. The problematic part of • P is a finite set of rules. A rule (production) is written as parsing C with directives is textual macros. Without a pre- A ::= ↵, where A (head) is a nonterminal, and ↵ (body) processor to expand macros before parsing, we need to deal is a string in (N T )⇤; with macros at runtime. Parsing C without a preprocessor is [ • S N is the start symbol of the grammar. future work. In C#, #define does not define a macro, rather 2 it only sets a boolean variable. It should also be noted that We use A, B, C to range over nonterminals, and a, b, c to C# supports multi-line strings, where directives should not range over terminals. We use ↵,, for a possibly empty be processed. Figure 4 shows a C# example that uses condi- sequence of terminals and nonterminals, and ✏ represents tional directives in a multi-line string. In single-phase pars- the empty sequence. It is common to group rules with the ing, however, multi-line strings are not a problem, as we ef- same head and write them as A ::= ↵ ↵ ... ↵ . In this 1 | 2 | | n fectively parse each terminal in the context where it appears. representation, each ↵i is an alternative of A.

5 Data-dependent grammars introduce parametrized non- In addition, EBNF constructs introduce new scopes: vari- terminals, arbitrary computation via an expression language, ables declared inside an EBNF construct, e.g., (l : A [e]) , ⇤ constraints, and variable binding. Here, we assume that the are not visible outside, e.g., in the rule that uses it. expression language e is a simple functional programming Finally, we also introduce a conditional selection sym- language with immutable values and no side-effects. In a bol e ? ↵ : , which selects ↵ if e is evaluated to true, oth- data-dependent grammar a rule is of the from A(p)::=↵, erwise , i.e., introduces deterministic choice. Similar to where p is a formal parameter of A. Here, for simplicity of EBNF constructs, we implement conditional selection by presentation and without loss of generality, we assume that desugaring it into a data-dependent grammar. For example, a nonterminal can have at most one parameter. The body of A ::= ↵e? X : Yis translated to A ::= ↵C(e) , where a rule, ↵, can now contain the following additional symbols: C(b)::=[b] X [!b] Y . We illustrate use of the conditional | selection when discussing C# directives in Section 3.6. • x = l : A(e) is a labeled call to A with argument e, label l, and variable x bound to the value returned by A(e); 3.2 Single-phase Parsing Strategy • l :a is a labeled terminal a with label l; We implement our data-dependent grammars on top of the • [e] is a constraint; generalized LL (GLL) parsing algorithm [33]. As general • x = e is a variable binding; parsers can deal with any context-free grammar, lexical defi- { } • e is a return expression (only as the last symbol in ↵); nitions can be specified to the level of characters. For exam- { } ple, comment in the C# specification [30] is defined as: • e ? ↵ : is a conditional selection. Comment ::= SingleLineComment | DelimitedComment The symbols above are presented in their general forms. For SingleLineComment ::= "//" InputCharacter* InputCharacter ::= ![\r \n] example, labels, variables to hold return values, and return DelimitedComment ::= "/*" DelimitedCommentSect* [*]+ "/" expressions are optional. DelimitedCommentSect ::= "/" [*]* NotSlashOrAsterisk Our data-dependent grammars are very similar to the ones NotSlashOrAsterisk = ![/ *] introduced in [16] with four additions. First, terminals and Such character-level grammars, however, lead to very nonterminals can be labeled, and labels refer to properties large parse forests. These parse forests reflect the full struc- associated with the result of parsing a terminal or nonter- ture of lexical definitions, which are not needed in most minal. These properties are the start input index (or left ex- cases. We provide the option to use an on-demand context- tent), the end input index (or right extent), and the parsed aware scanner, where terminals are defined using regular ex- substring. Properties can be accessed using dot-notation, pression. For example, Comment in C# can be compiled to a e.g., for labeled nonterminal b : B, b.l gives the left extent, regular expression. In cases where the structure is needed, or b.r the right extent, and b.yield the substring. it is not possible to use a regular expression, e.g., recursive Second, nonterminals can return arbitrary values (return definitions of nested comments, the user can use character- expressions) which can be bound to variables. In several level grammars. cases, we found this feature very practical as we could ex- Our support for context-aware scanning borrows many press data dependency without changing the shape of the ideas from the original work by Van Wyk and Schwerd- original specification grammar. Specifically, cases where a feger [37], but because of the top-down nature of GLL pars- global table needs to be maintained along a parse (C# con- ing, there are some differences. The original context-aware ditional directives discussed in Section 3.6 and C typedef scanning approach [37] is based on LR parsing, and as each declarations in Section 3.7), or where semantic information LR state corresponds to multiple grammar rules, there may needs to be propagated upwards from a complicated syntac- be several terminals that are valid at a state. The set of valid tic structure (Declarator of the C grammar in Section 3.7). In terminals in a parsing state is called valid lookahead set [37]. some cases a data-dependent grammar that uses return val- In GLL parsing, in contrast, the parser is at a single grammar ues can be rewritten to one without return values. However, position at each time, i.e., either before a nonterminal or be- in general, whether return values enlarge the class of lan- fore a terminal in a single grammar rule. Therefore, in GLL guages expressible with the original data-dependent gram- parsing, the valid lookahead set of a terminal grammar posi- mars is an open question for future work. tion contains only one element, which allows us to directly Third, we support regular expression operators (EBNF call the matcher of the regular expression of that terminal. constructs): , +, and ?, by desugaring them to data depen- We use our simple context-aware scanning model for ⇤ dent rules as follows: A ::= A+ ✏ ; A+::=A+A A ; and better performance, see Section 5.1. The implementation ⇤ | | A?::=A ✏. In the data-dependent setting, this translation of the context-aware scanner in [37] is more sophisticated. | must also account for variable binding. For example, if sym- The scanner is composed of all terminal definitions, as a bol ([e] A) appears in a rule, and x is a free variable in e, composite DFA. This enables a longest match scheme across ⇤ captured from the scope of the rule, our translation lifts this terminals in the same context, for example in programming variable, introducing a parameter x to the new nonterminal. languages where one terminal is a prefix of another, e.g.,

6 A ::= ↵ B !>> c A ::= ↵ b:B [input.at(b.r) != c] Figure 5 shows the mapping from character-level disam- A ::= ↵ c !<< B A ::= ↵ b:B [input.at(b.l-1) != c] biguation filters to data-dependent grammars. The mapping A ::= ↵ B\s A ::= ↵ b:B [input.sub(b.l,b.r)!= s] is straightforward: each restriction is translated into a con- Figure 5. Mapping of lexical disambiguation filters. dition that operates on the input. A note should be made regarding the condition implementing precede restrictions. This condition only depends on the left extent, b.l, that per- B 'fun' and 'function' in OCaml. To enforce longest match mits its application before parsing . We consider this opti- across terminals we use follow/precede restrictions, in this mization in the implementation of our parsing framework, case a follow restriction on 'fun' or a precede restriction on permitting application of such conditions before parsing la- identifiers. Moreover, keyword reservation in [37] is done beled nonterminals or terminals. by giving priority to keywords at matching states of the The restrictions of Figure 5 are just examples and can be composite DFA. In our model, keyword exclusion should extended in many ways. For example, instead of defining be explicitly applied in the grammar rules using an exclude the restriction using a single character, we can use regular disambiguation filter. We explain follow/precede restrictions expressions or character classes. One can also define similar and keyword reservation in Section 3.3. restrictions for related disambiguation purposes. For exam- In single-phase parsing layout (whitespace and com- ple, consider the cast expression in C#: ments) are treated the same way as other lexical definitions. cast-exp ::= '(' type ')' unary-exp Because layout is almost always needed in parsing program- An expression such as (x)-y is ambiguous, and can be in- ming languages, we support automatic layout insertion into terpreted as either a type cast of -y to the type x, or a sub- the rules. There are two approaches to deal with layout in- traction of y from (x). In the C# language specification, it is sertion: a layout nonterminal can be inserted exactly before stated that this ambiguity is resolved during parsing based or after each terminal node [20, 37]. Another way is to in- on the character that comes after the closing parentheses: if sert layout between the symbols in a rule, like in SDF [12]. the character following the closing parentheses is ~, '!', '(', We use SDF-style layout insertion: if X ::= x1x2 ...xn an identifier, a literal or keywords, the expression should be is a rule, and L is a nonterminal defining layout, after the interpreted as a cast. We can implement this rule as follows: X ::= x Lx L...Lx layout insertion, the rule becomes 1 2 n. cast-exp ::= '(' type ')' >>> [ !(A-Za-z0-9] unary-exp A benefit of SDF-style layout insertion is that no symbol ⇠ definition accidentally ends or starts with layout, provided The >>> notation specifies that the next character after the that the layout is defined greedily (see Section 3.3). This is closing parentheses should be an element of the specified helpful when defining the offside rule (see Section 3.5). character class. The implementation of >>> is similar to that of >> with an additional aspect: it adds the condition on the automatically inserted layout nonterminal after ')' instead. 3.3 Lexical Disambiguation Filters These examples show how more syntactic sugar can be Common lexical disambiguation filters [36], such as fol- added to the existing framework for various common lexi- low restrictions, precede restrictions and keyword exclu- cal disambiguation tasks in programming languages without sion, can be mapped to data-dependent grammars without changes to the underlying parsing technology. further extensions to the parser generator or parsing algo- rithm. These disambiguation filters are common in scanner- 3.4 Operator Precedence and Associativity less parsing [32] and have been implemented for various Expression grammars in their natural form are often ambigu- generalized parsers [2, 38]. ous. Consider the expression grammar in Figure 6 (left). For A follow restriction (!>>) specifies which characters can- this grammar, the input string a+a*a is ambiguous with two not immediately appear after a symbol in a rule. This re- derivation trees that correspond to the following groupings: striction is used to locally define longest match (as op- (a+(a*a)) and ((a+a)*a). Given that * normally has higher posed to a global longest match in the lexer). For ex- precedence than +, the first derivation tree is desirable. We ample, to enforce longest match on identifiers we write use >, left, and right to define priority and left- and right- Id ::= [A-Za-z]+ !>> [A-Za-z]. A precede restriction (!<<) associativity, respectively [3]. Figure 6 (right) shows the dis- is similar to a follow restriction, but specifies the characters ambiguated version of this grammar by specifying > and that cannot immediately precede a symbol in a rule. Precede left, where - has the highest precedence, and * and + are restrictions can be used to implement longest match on key- left-associative. words. For example, [A-Za-z] !<< Id disallows an identifier Ambiguity in expression grammars is caused by deriva- to start immediately after a character. This disallows, for tions from the left- or right-recursive ends in a rule, i.e., example, to recognize intx as the keyword 'int' followed E ::= ↵E and E ::= E. We use >, left, and right to spec- by the identifier x. Finally, exclusion (\) is usually used to ify which derivations from the left- and right-recursive ends implement keyword reservation. For example, Id \'int' ex- are not valid with respect to operator precedence. For ex- cludes the keyword int from beging recognized as Id. ample, E ::= '-' E > E '*'Especifies that E in the '-'-rule

7 E ::= '-' E E ::= '-' E E ::= E '*'E left E(p) ::= [2 >= p] E(2) * E(3) //2 |E'*'E >E'*'E left > E '+' E left | [1 >= p] E(0) + E(2) //1 | E '+' E > E '+' E left | '(' E ')' | '(' E(0) ')' | 'if' E 'then' E 'else' E > 'if' E 'then' E 'else' E |a |a |a |a Figure 7. An expression grammar with > and left (left), and Figure 6. An ambiguous expression grammar (left), and the its translation to data-dependent grammars (right). same grammar disambiguated with > and left (right).

E(l,r) ::= [4 >= l] '-' E(l,4) //4 | [3 >= r, 3 >= l] E(3,3) '*' E(l,4) //3 (parent) should not derive the '*'-rule (child). The > con- | [2 >= r, 2 >= l] E(2,2) '+' E(l,3) //2 struct only restricts the right-recursive end of a parent rule | [1 >= l] 'if' E(0,0) 'then' E(0,0) 'else' E(0,0) //1 when the child rule is left-recursive, and vice versa. For ex- |a ample, in Figure 6 (right) the right E in the '+'-rule is not Figure 8. Operator precedence with data-dependent gram- restricted because the 'if'- rule is not left-recursive. This is mars (binary and unary operators). to avoid parse error on inputs that are not ambiguous, e.g., a + if a then a else a. Note that 'if' E 'then' E 'else' in round bracket rule. Nonterminal E gets parameter p to pass the 'if'-rule acts as a unary operator. In addition, the > oper- the precedence level, and for each left- and right-recursive ator is transitive for all the alternatives of an expression non- rule, a predicate is added at the beginning of the rule to ex- terminal. Finally, left and right only affect binary recursive clude rules by comparing the precedence level of the rule rules and only at the left- and right-recursive ends. with the precedence level passed to the parent E. Finally, for Although > is defined as a relationship between a parent each use of E in a rule, an argument is passed. rule and a child rule, its application may need to be arbitrary In the '*'-rule, its precedence level (2) is passed to the left deep in a derivation tree. For example, consider the input E, and its precedence level plus one (3) is passed to the right string a * if a then a else a + a for the grammar in Fig- E. This allows to exclude the rules of lower precedence from ure 6 (right). This sentence is ambiguous with two derivation the left E, and to exclude the rules of lower precedence and trees that correspond to the following groupings: the '*'-rule itself from the right E. Excluding the '*'-rule it- (a * (if a then a else a)) + a self allows only the left-associative derivations, e.g., (a*a)*a, a * (if a then a else (a + a)) as specified by left. In the '+'-rule, its precedence level plus The first grouping is not valid as 'if' binds stronger than one (2) is passed to the right E, excluding the '+'-rule. The '+', but we defined '+' to have higher priority than 'if'. This value 0 is passed to the left E, permitting any rule. Note that example shows that restricting derivations only at one level passing 0 instead of 1 to the left E of the '+'-rule achieves cannot disambiguate such cases. A correct implementation the same effect but enables better sharing of calls to E, as of > thus also restricts the derivation of the 'if'-rule from the the sharing of calls (using GSS) is done based on the name right-recursive end of the '*'-rule if the '*'-rule is derived of the nonterminal and the list of arguments. In the round from the left-recursive end of the '+'-rule. bracket rule, 0 is passed to E as the use of E is neither left- We now show how to implement an operator precedence nor right-recursive, hence, the precedence does not apply. disambiguation scheme using data-dependent grammars. We Now we discuss the translation of the example shown in first demonstrate the basic translation scheme using binary Figure 6 that contains both binary and unary operators. For operators only, and then discuss the translation of the exam- this we need to distinguish between the rules that should be ple in Figure 6. Figure 7 (left) shows a simple example of an excluded from the left and from the right E. This is achieved expression grammar that defines two left-associative binary as follows. First, E gets two parameters, l and r (Figure 8), operators * and +, where * is of higher precedence than +. to distinguish between the precedence level passed from left Figure 7 (right) shows the result of the translation into the and right, respectively. Second, a separate condition on l data-dependent counterpart. The basic idea behind the trans- is added to a rule when the rule can be excluded from the lation is to assign a number, a precedence level, to each left- right E (i.e., rules for binary operators and unary postfix and/ or right recursive rule of nonterminal E, to parameterize operators). A separate condition on r is added to a rule E with a precedence level, and based on the precedence level when the rule can be excluded from the left E (i.e., rules passed to E, to exclude alternatives that will lead to deriva- for binary operators and unary prefix operators). Third, l- tion trees that violate the operator precedence. and r-arguments are determined for the left and right E’s In Figure 7 (right) each left- and right-recursive rule in as follows. An l-argument to the left E and r-argument to the grammar gets a precedence level (shown in comments), the right E are determined as in the example of Figure 7. which is the reverse of the alternative number in the defi- For example, E(3,_)'*' E(_,4), where 3 is the precedence nition of E. The precedence counter starts from 1 and incre- level of the '*'-rule, and 4 is the precedence level plus one. ments for each encountered > in the definition. The number 0 Note that r=4 does not exclude the unary operators of E. is reserved for the unrestricted use of E, illustrated using the Now, an l-argument to the right E’s is propagated from the

8 Decls ::= align (offside Decl)* Decls ::= a0:Star1(a0.l) | ignore('{' Decl (';' Decl)* '}') | ignore('{' Decl Star2 '}') Decl ::= FunLHS RHS Decl ::= FunLHS RHS RHS ::= '=' Exp 'where' Decls RHS ::= '=' Exp 'where' Decls Simplified version of Haskell’s Decls. Figure 9. Star1(v) ::= Plus1(v) | ✏ Plus1(v) ::= offside a1:Decl [col(a1.l) == col(v)] | Plus1(v) offside a1:Decl [col(a1.l) == col(v)] parent E. This effectively excludes a unary prefix rule from Star2 ::= Plus2 | ✏ the right E’s when the parent E is the left E of a rule of higher Plus2 ::= Plus2 Seq2 | Seq2 precedence than the unary operator. Finally, given that there Seq2 ::= ';' Decl are no unary postfix operators, an r-argument to the left E’s Figure 10. Desugaring of align. is not propagated from the parent E and can be the same as the respective l-argument. We have also extended this approach for grammars that Decls(i,fst) ::= a0:Star1(a0.l,i,fst) allow rules of the same precedence and/or associativity | '{' Decl(-1,0) Star2 '}' Decl(i,fst) ::= FunLHS(i,fst) RHS(i,0) groups. For example, binary + and - operators have the RHS(i,fst) ::= o0:'=' [f(i,fst,o0.l)] Exp(i,0) same precedence, but are left-associative with respect to o1:'where' [f(i,0,o1.l)] Decls(i,0) each other. Star1(v,i,fst) ::= Plus1(v,i,fst) | ✏ Our translation of operator precedence to data-dependent Plus1(v,i,fst) grammars resembles the precedence climbing technique [5, ::= Plus1(v,i,fst) a1:Decl(a1.l,1) 31]. In contrast to precedence climbing that requires a non- [col(a1.l) == col(v), f(i,0,a1.l)] left recursive grammar, our approach works in presence of | a1:Decl(a1.l,1) [col(a1.l) == col(v), f(i,fst,a1.l)] Star2 ::= Plus2 | ✏ both left- and right-recursive rules. Plus2 ::= Plus2 Seq2 | Seq2 Seq2 ::= ';' Decl(-1,0) 3.5 Indentation-sensitive Constructs In this section we show how the offside rule can be translated f(i,fst,l) = i == -1 || fst == 1 || col(l) > col(i); into data-dependent grammars. We use Haskell as the run- Figure 11. Desugaring of offside and ignore. ning example, but our approach is also applicable for other programming languages that implement the offside rule. The basic idea of translating align is to use the start index In Haskell, one can write a where clause consisting of a block of declarations, where the structure of the block is de- of a declaration list, and constrain the start index of each fined by using either explicit delimiters or indentation (col- declaration in the list by an equality check on indentation at the respective indices. Figure 10 shows the result after umn number). For example, the structure of the following first desugaring align and then translating EBNF constructs blocks, one written with explicit delimiters, such as curly braces and semicolons (left), and the other written using in- (Section 3.1). Desugaring align alone results in: dentation (right), is the same: Decls ::= a0:(offside a1:Decl [col(a1.l) == col(a0.l)])*

{x=1* 2 + 3; y = x + 4 } x = 1 * 2 Labels a0 and a1 are introduced to refer to the start index +3 of a declaration list, a0.l, and to the start index of each y=x+4 declaration in the list, a1.l, respectively, and the constraint Figure 9 shows a simplified excerpt of the Haskell grammar, checks whether the respective column numbers (given tabs defined using our parsing framework. The first alternative of 8 characters) are equal. As in case of precede restric- explicitly enforces indentation constraints on a declaration tions in Section 3.3, this constraint only depends on the block. First, it requires that all declarations of a block are start indices and can be applied before parsing Decl. The aligned (align) with respect to each other, i.e., each decla- EBNF translation introduces nonterminals for each EBNF ration starts with the same indentation. Second, it requires construct, where Star1 and Plus1 also get parameter v as the that the offside rule applies to each declaration, i.e., all non- use of a0 has to be lifted during the translation. whitespace tokens of a declaration are strictly indented to Figure 11 shows the result of desugaring offside and the right of its first non-whitespace token. In contrast, the ignore from Figure 10. The basic idea is to pass down second alternative of Decls enforces the use of curly braces Decl’s start index and constrain the indentation of any non- and semicolons, and explicitly ignores (ignore) indentation whitespace terminal that can appear under the Decl-node, ex- constraints even when imposed by an outer scope. cept for the leftmost one, to be greater than the indentation of In our meta-notation, align only affects regular defini- Decl’s start index. Two parameters, i and fst, are introduced tions (EBNF constructs) such as lists and sequences, offside to Decl and to all nonterminals reachable from it. The first affects nonterminals, and ignore applies to a sequence of parameter is used to pass Decl’s start index, calculated at the symbols. The translation of these high-level constructs into offside application site (a1.l), to any nonterminal reachable data-dependent grammars is illustrated in Figures 10 and 11. from Decl, and to constrain terminals reachable from Decl.

9 The second parameter, fst, which is either 0 or 1, is used global defs = {} to identify and skip the leftmost terminal that should not be Layout ::= (Whitespace | Comment | Decl | If | Gbg)* 1 constrained. The value is passed at the application site of !>> [\ \t\n\r\f] !>> '/*' !>> '//' !>> '#' offside and propagated down to the first nonterminal of each reachable rule if the rule starts with a nonterminal. The value Decl ::= '#' 'define' id:Id {defs=put(defs,id.yield,true)} PpNL 0 is passed to any other nonterminal of a reachable rule when | '#' 'undef' id:Id the first symbol of the rule is not nullable. Our translation {defs=put(defs,id.yield,false)} PpNL also accounts for nullable nonterminals (not shown here), and in such cases the value of fst also depends on a dy- If ::= '#' 'if' v=Exp(defs) [v] ? Layout namic check whether the left and right extents of the node : (Skipped (Elif|Else|PpEndif)) Elif ::= '#' 'elif' v=Exp(defs) [v] ? Layout corresponding to a nullable nonterminal are equal. : (Skipped (Elif|Else|PpEndif)) Finally, each terminal reachable from Decl gets a label Else ::= '#' 'else' Layout (labels starting with o), to refer to its start index, and a Gbg ::= GbgElif* GbgElse? '#' 'endif' f constraint, encoded as a call to boolean function . Note that GbgElif ::= '#' 'elif' Skipped in the definition of f, condition i == -1 corresponds to the GbgElse ::= '#' 'else' Skipped case when Decl appears in the context where the offside rule does not apply or is ignored, and condition fst == 1 to the Skipped ::= Part+ Part ::= PpCond | PpLine | ... // etc. case of the leftmost terminal. PpCond ::= PpIf PpElif* PpElse? PpEndif The offside, align and ignore constructs are examples PpIf ::= '#' 'if' PpExp PpNL Skipped? of reasonably complex desugarings to data-dependent gram- PpElif ::= '#' 'elif' PpExp PpNL Skipped? mars. Their existence and their aptness to describe the syn- PpElse ::= '#' 'else' PpNL Skipped? PpEndif ::= '#' 'endif' PpNL tax of Haskell is a witness of the power of data-dependent grammars and the parsing architecture we propose. Figure 12. The grammar of conditional directives in C#.

3.6 Conditional Directives each nonterminal that directly or indirectly accesses a global In this section we present our solution for parsing condi- variable should get an extra parameter, and each nonterminal tional directives in C#. As discussed in Section 2.5, most that can directly or indirectly update a global variable should directives can be regarded as comment, but conditional di- return the new value of the variable if the variable is used rectives have to be evaluated during parsing, as they may after an occurrence of the nonterminal in a rule. Note that, affect the syntactic structure of a program. assuming immutable values, such implementation of global Conditional directives can appear anywhere in a program. variables properly accounts for the nondeterministic nature Therefore, it is natural to define them as part of the layout of generalized parsing. This way updates to a variable made nonterminal. Figure 12 shows relevant parts of the layout along one parse do not interfere with updates made along an definition (Layout)2 we used to parse C# (follow restrictions alternative parse. enforce longest match). In addition to whitespace charac- The basic idea of single-phase parsing of C# in presence ters (Whitespace) and comments (starting with '/*' or '//'), of conditional directives is as follows. Recall that in our pars- the layout consists of declaration directives (Decl) and con- ing strategy (Section 3.2) layout is inserted between symbols ditional directives (If). in a grammar rule. Conditional directives are evaluated as According to the C# language specification, the scope part of the layout nonterminal, and based on the result of the of symbols introduced by declaration directives #define and evaluation, the next lines of source code are either treated as #undef is the file they appear in. Therefore, we need to main- the actual source code (true case), or as a sequence of valid tain a global symbol table defs to declare (see Decl) and ac- C# tokens (false case), also consuming directives that should cess (see If and Elif) symbol definitions while parsing. In not be evaluated. To achieve this, the grammar of Figure 12 C# one can define/undefine a symbol, but a value cannot be uses two different definitions for #if, #elif and #else. The assigned to a symbol. Thus, the symbol table needs to asso- bottom definition (PpIf, PpElif and PpElse), which is found ciate a boolean value with a symbol. in the C# specification, simply defines directives as part of To enable global definitions, our parsing framework sup- valid C# tokens (Skipped), while the top definition (If, Elif ports global variables that can be declared using the global and Else) uses data dependency. Note that conditional direc- keyword. e.g., defs in Figure 12. In our parsing framework, tives can be nested. This is expressed by using Layout in If, a global variable is implemented by using parameters and Elif and Else, and Skipped in PpIf, PpElif and PpElse. return values to thread a value through a parse. In this case, Whenever an #if-directive and its expression are parsed as part of If, the expression is evaluated using the symbol 2 For readability reasons, we omit uses of Whitespace? (optional whites- pace) before and after terminal '#', and uses of Whitespace after termi- table (defs). Exp (not shown in Figure 12) defines a sim- nals 'define', 'undef', 'if', 'elif'. ple boolean expression. To enable evaluation of expressions

10 Element ::= STag Content ETag global defs = [{}] STag ::= '<' Name Attribute* '>' ETag ::= '' Declaration ::= x=Specifiers Declarators(x)

Element ::= s=STag Content ETag(s) Specifiers ::= x=Specifier y=Specifiers {x || y} STag ::= '<' n:Name Attribute* '>' { n.yield } | x=Specifier {x} ETag(s) ::= '' Specifier ::= "typedef" {true} | ... Context-free grammar of XML elements (top) Figure 13. Declarators(x) ::= s=Declarator {h=put(head(defs),s,x); and the data-dependent version (bottom). defs=list(h,tail(defs))} ("," Declarators(x))* Declarator ::= id:Identifier {id.yield} | x=Declarator "(" ParameterTypeList ")" {x} while parsing, Exp uses data dependency and extends the | ... Expr ::= Expr "-" Expr PpExp rules, found in the C# specification, with return val- | "(" n:TypeName [isType(defs,n.yield)] ")" Expr ues and boolean computation. If the expression evaluates to | "(" Expr ")" true (note the use of conditional selection), the parser first | ... continues consuming layout, including the nested directives, | Identifier [!isType(defs,n.yield)] and then, after no layout can be consumed, the parser returns Figure 14. Resolving typedef ambiguity in C. to the next symbol in the alternative. If the expression evaluates to false, the parser consumes Now, we consider the problem of typedef ambiguity in part of the input as a list of valid C# tokens (Skipped) until C. For example, expression (T)+1 can have two meanings, it finds the corresponding #elif-, #else- or #endif-part. Note depending on the meaning of T in the context: a cast to type T that Skipped also consumes nested #if-directives (PpCond), if any, but in this case, conditions are not evaluated. The defi- with +1 being a subexpression, or addition with two operands (T) and 1. If T is a type, declared using typedef, the first parse nition of Skipped also allows to consume invalid C# structure (only valid token-wise) when the condition is false, see Fig- is valid, otherwise the second one. To resolve the typedef ambiguity, type names should be ure 3 (right). Finally, when all #if, #elif and #else directives distinguished from other identifiers, such as variables and are present, there will be dangling #elif, #else, and #endif parts remaining if one of the conditions evaluates to true. function names, during parsing. In addition, the scoping These dangling parts should be also consumed by the lay- rules of C should be taken into account. For example, con- sider the following C program: out. The Gbg (garbage) nonterminal, defined as part of layout, does exactly this. typedef int T; main() { int T = 0, n = (T)+1; 3.7 Miscellaneous Features } In this section we discuss the use of data-dependent gram- In this example, T is first declared as a type alias to int and mars for parsing XML and resolving the infamous typedef then redeclared as a variable of type int in the inner scope ambiguity in C. XML has a relatively straightforward syn- introduced by the main function. tax. Figure 13 (top) shows the context-free definition of Figure 14 shows a simplified excerpt of our data-dependent Element in XML, where Content allows a list of nested ele- C grammar. The excerpt shows the declaration and expres- ments. The problem with this definition is that it can recog- sion parts of the C grammar. As can be seen, a C decla- nize inputs with unbalanced start and end tags, for example: ration consists of a list of specifiers followed by a list of declarators. Each declarator declares one identifier. Key- Bob word typedef can appear in the list of specifiers, for exam- Alice ple, along with the declared type. A declarator can be either a simple identifier or a more complicated syntactic structure, Using data-dependent grammars, the solution to match e.g., array and function declarators, nesting the identifier. It start and end tags is very intuitive. Figure 13 (bottom) shows is important to note that an identifier should enter the cur- a data-dependent grammar for XML elements. As can be rent scope when its declarator is complete. The expression seen, inside a starting tag, STag, the result of parsing Name is part of Figure 14 shows the cast expression rule (the sec- bound to n, and the respective substring, n.yield, is returned ond rule from top), and the primary expression rule (the last from the rule. The returned value is assigned to s in the one). Note that to resolve the typedef ambiguity, illustrated Element rule, and is passed to the end tag, ETag. Finally, in the in our running example, an identifier should be accepted as ETag, the name of the end tag is checked against the name of an expression if it is not declared as a type name. the starting tag. If the name of the starting tag is not equal to To distinguish between type names and other identifiers, the name of the end tag, i.e., n.yield == s does not hold, the we record names, encountered as part of declarators, and as- parsing pass dies. sociate a boolean value with each name: true for type names

11 and false otherwise. To maintain this information during E + E parsing, we introduce global variable defs, holding a list of - E maps to properly account for scoping. At the beginning of E parsing, defs is a list containing a single, empty map. At the a beginning of a new scope, i.e., when "{" is encountered, an empty map is prepended to the current list resulting in a new Figure 15. ATN Grammar for E ::= E + E E a. | | list which is assigned to defs (not shown in the figure). At the end of the current scope, i.e., when "}" is encountered, the head of the current list is dropped by taking the tail of tion before or after a symbol in a rule) and executes the code the list and assigning it to defs. corresponding to this slot. Because of the nondeterministic To communicate the presence of typedef in a list of spec- nature of general parsing, a GLL parser needs to record all ifiers, we extend each rule of Specifier to return a boolean possible paths and process them later, and at the same time value: "typedef"-rule returns true, and the other rules return eliminate duplicate jobs. The unit of work in GLL parsing false. Specifiers computes disjunction of the values asso- is a descriptor which captures a parsing state. Descriptors ciated with the specifiers in the resulting list. This informa- allow a serialized, discrete view of tasks performed during tion is passed via variable x to Declarators. We also extend parsing. GLL parsing has a main loop, in a trampolined style, the rules of Declarator to return the declared name, id.yield. that executes the descriptors one at a time until no more de- After a declarator is parsed, the declared name can be stored scriptors left. in defs: pair (s,x) is added to the map taken from the head The standard way of implementing a GLL parser is to of the current list, and a new list, with the resulting map as generate code for each grammar slot [33]. Such implemen- its head, is created and assigned to defs. tation relies on dynamic gotos to allow arbitrary jumps to Finally, isType function is used to check whether the the main loop or other grammar slots. In our GLL imple- current identifier is a type name in the current scope or not: mentation, a grammar is modeled as a connected graph of isType iterates over elements in defs, starting from the first grammar slots. This model of context-free grammars re- element, to look up the given name. If the name is not found sembles Woods’ Recursive Augmented Transition Networks in the current map, isType continues the search with the next (ATN) [40] grammars. As such, our implementation of GLL element, representing the outer scope. If the name is found, over ATN grammars (Appendix A) provides an interpreter isType returns the boolean value associated with the name. If version of GLL parsing. none of the maps contains the name, isType returns false. In our running example, after parsing the second decla- 4.1 ATN Grammars ration of T, appeared in the scope of the main function, pair ATN grammars are an automaton formalism developed in ("T",false) will be added to the map in the head of defs, the 70s to parse natural languages, and are similar to nonde- effectively shadowing the previous typedef declaration of T, terministic finite automata. and causing the condition in the cast expression rule to fail. An ATN grammar is a tuple (Q, F, ) where ! 4. Implementation • Q is a finite set of states representing grammar slots; In this section we present our extension of the GLL parsing • F Q is a finite set of states representing final grammar ⇢ algorithm [33] to support data-dependent grammars. GLL slots; and A parsing is a generalization of recursive-descent parsing that • is a transition relation of the form (nonterminal), supports all context-free grammars, and produces a bina- !t ✏ ! (terminal), or (epsilon). rized Shared Packed Parse Forest (SPPF) in cubic time and ! ! space. GLL uses a Graph-Structured Stack (GSS) [35] to For example, the ATN grammar for E ::= E + E E a | | handle multiple function calls in recursive-descent parsing. is shown in Figure 15. In an ATN, there is a one-to-many The problem of left recursion is solved by allowing cycles relation, S String Q, from a nonterminal name to a ⇢ ⇥ in GSS. As GLL parsers are recursive-descent like, the han- set of start states, each representing the initial state of an dling of parameters and environment is intuitive, and the im- alternative. plementation remains very close to the stack-based seman- Constructing an ATN grammar from a CFG is straight- tics, which eases the reasoning about the runtime behavior forward. For each nonterminal in the grammar, and for each of the parser. More information on GLL parsing over ATN, alternative of the nonterminal, a pair consisting of the non- GSS, and SPPF is provided in Appendix A. terminal’s name and a state representing the start state of the We use a variation of GLL parsing that uses a more effi- alternative is added to S. Finally, for each symbol in the al- cient GSS [2]. GLL parsing can be seen as a grammar traver- ternative, a next state is created, and a transition, labeled with sal process that is guided by the input. At each point during the symbol, from the previous state to this state is added. The parsing, a GLL parser is at a grammar slot (grammar posi- last state of the alternative is marked as a final grammar slot.

12 [2 >= l, 2 >= r] E(2, 2) + E(l, 3) Finally, a GLL parser constructs a binarized SPPF (Ap- pendix A.1), creating terminal nodes (nodeT), and nonter- [1 >= l] - E(0, 0) E(l, r) minal and intermediate nodes (nodeP). In GLL parsing in- a termediate nodes are essential. In particular, they allow the parser to carry a single node at each time by grouping the Figure 16. Data-dependent ATN grammar for E ::= E + symbols of a rule in a left-associative manner. Nonterminal E> E a after desugaring operator precedence. and intermediate nodes can be ambiguous. To properly han- | dle ambiguities under nonterminal and intermediate nodes, we include environment and return values into the SPPF con- 4.2 Data-dependent ATN Grammars struction. Specifically, arguments to nonterminals and return values are part of nonterminal nodes, and environment is part To support data-dependent grammars, we extend ATN gram- of intermediate nodes. mars with the following forms of transitions: Figure 17 presents the semantics of GLL parsing over x=l:A(e) l:t • (parameterized, labeled nonterminals) and ATN, defining it as a transition relation on configuration ! ! ( , , , ) where the elements are four main structures (labeled terminals) R U G P maintained by a GLL parser: x=e [e] e • (variable binding), (constraint) and (return ! ! ! • is a set of pending descriptors to be processed expression). R • is a set of descriptors created during parsing. This set Two additional mappings are maintained: L, X : Q String U ! is maintained to eliminate duplicate descriptors that map a state, representing a grammar slot after a labeled • is a GSS, represented by a set of GSS edges nonterminal, to the nonterminal’s label (l) and to the nonter- G minal’s variable (x), respectively. Here, as in Section 3.1, for • is a set of parsing results (SPPF nodes created for P simplicity of presentation and without loss of generality, we nonterminals) associated with GSS nodes, i.e., a set of assume that nonterminals can have at most one parameter. elements of the form (u, w) We also only consider cases of labeled terminals and non- During parsing, a descriptor is selected and removed from terminals, and when a return expression is present. Finally, , represented as (p, i, E, u, w) , and given the rules, R { }[R we assume that expression language e is a simple functional a deterministic choice is made based on the next transition programming language with immutable values and no side- in the ATN. The simplest rules are Eps, Cond-1, Cond-2 effects, that labels and variables are scoped to the rules they and Bind. Eps creates the ✏-node (via call to nodeT) and an are introduced in, and that labels and variables introduced by intermediate node (via call to nodeP), and adds a descriptor desugaring have unique names in their scopes. for the next grammar slot. Cond-1 and Cond-2 depend on the An example of a data-dependent ATN is shown in Fig- evaluation of expression e in a constraint. If the expression ure 16. This ATN grammar is the disambiguated version of evaluates to true, a new descriptor is added to continue the grammar shown in Figure 15 after desugaring operator with the next symbol in the rule (Cond-1), otherwise no precedence. descriptor is added (Cond-2). Bind evaluates the expression in an assignment and creates a new environment containing 4.3 Data Dependency in GLL Parsing the respective binding. This environment is used to create In the following, p, q, s represent ATN states in Q, i is an the new descriptor added to . R input index, u, u0 represent GSS nodes, and w, n, y repre- Term-1 and Term-2 deal with labeled terminals. If ter- sent SPPF nodes. To support data-dependent grammars, we minal t matches (Term-1) the input string (represented by introduce an environment, E, into GLL parsing. Here, we an array I) starting from input position i, a terminal node assume that E is an immutable map of variable names to is created (assuming t is of length 1). Then, the properties, values. In the data-dependent setting, a descriptor, the unit i.e., the left and right extents, and the respective substring, of work in GLL parsing, is of the form (p, i, E, u, w). Now, are computed from the resulting node (props(y)). Finally, a a descriptor contains an environment E that has to be stored new environment, containing binding [l = props(y)], is cre- and later used whenever the parser selects the descriptor to ated and used to construct an intermediate node and a new continue from this point. GSS is also extended to store ad- descriptor. If the terminal does not match (Term-2), no de- ditional data. A GSS node and a GSS edge are now of the scriptor is added. forms (A, i, v) and (u, p, w, E, u0), respectively. That is, in Call-1 and Call-2 deal with labeled calls to nonterminals. addition to the current input index i, a GSS node stores an First, argument e is evaluated, where E1 allows the use of argument v, passed to a nonterminal A, to fully identify the the left extent in e (lprop constructs properties with only left call. A GSS edge additionally stores an environment E, to extent). If a GSS node, representing this call, already exists capture the state of the parser before a call to a nonterminal (Call-1), the parsing results associated with this GSS node is made. are reused, and a possibly empty set of new descriptors ( ) D

13 ( , , , ) ( 0, 0, 0, 0) argument of the current GSS node (arg(u)) and the return R U G P ) R U G P value. This node is recorded in as a result associated with ✏ p q P ! the GSS node. For each GSS edge directly reachable from n=nodeP(q, w, nodeT(✏, i, i),E) d=(q, i, E, u, n) Eps the current GSS node, a new descriptor is created. Note that ( (p, i, E, u, w) , , , ) ( d , , , ) { }[R U G P ) R[{ } U G P labels and variables at call sites, represented by the current l:t p qI[i]=t GSS node, are retrieved via mappings L and X, respectively. ! y =nodeT(t, i, i+1) E1 =E[l =props(y)] n=nodeP(q, w, y, E1) d=(q, i+1,E1,u,n) Term-1 5. Evaluation ( (p, i, E, u, w) , , , ) ( d , , , ) { }[R U G P ) R[{ } U G P Our data-dependent parsing framework is implemented as an l:t p qI[i] = t Term-2 ! 6 extension of the Iguana parsing framework [2]. The addition ( (p, i, E, u, w) , , , ) ( , , , ) { }[R U G P ) R U G P of data-dependency is at the moment a prototype and most of x=l:A(e) the effort was put into correctness, rather than performance p q ! optimization. As a frontend to write data-dependent gram- E1 =E[l =lprop(i)] [[e]] E1 = vu=(A, i, v) ( ) 0 2N G = d (u ,y) ,E2 =E[l =props(y),x=val(y)], mars, we extended the syntax definition of Rascal [24], a D { | 0 2P d=(q, rext(y),E2,u,nodeP(q, w, y, E2)),d Call-1 62 U} programming language for meta-programming and source ( (p, i, E, u, w) , , , ) ( , , code analysis, and provided a mapping to Iguana’s internal { }([uR,q,w,E,uU G P )), R) [D U[D G[{ 0 } P representation of data-dependent grammars. x=l:A(e) In Section 2 we enumerated a number of challenges in p q ! E1 =E[l =lprop(i)] [[e]] E1 = vu=(A, i, v) ( ) parsing programming languages, and in Section 3, we pro- 0 62 N G = (s, i, [p0 =v],u0, $) s S(A) vided solutions based on data-dependent grammars (directly Call-2 D { | 2 } ( (p, i, E, u, w) , , , ) ( , , or via desugaring) that address these challenges. For each { (u}[,q,w,E,uR U G P) ,)) R[D U G[{ 0 } P challenge we selected a programming language that exhibits e 3 p qq F it, and wrote a data-dependent grammar , derived from the ! 2 [[ e]] E = vn= nodeP(q, w, arg(u),v) specification grammar of the language. For evaluation, we = d (u, s, y, E1,u ) ,E2 =E1[L(s)=props(n),X(s)=v], D { | 0 2G parsed real source files from the source distribution of the d=(s, i, E2,u0, nodeP(s, y, n, E2)),d Ret 62 U} language and some popular open source libraries, see Ta- ( (p, i, E, u, w) , , , ) ( , , , (u, n) ) { }[R U G P ) R[D U[D G P[{ } ble 2. Table 1 summarizes the evaluation results. In the fol- [e] p q [[ e]] E = true lowing we discuss these results in detail, and provide an !d =(q, i, E, u, w) analysis of the expected performance in practice. Cond-1 ( (p, i, E, u, w) , , , ) ( d , , , ) { }[R U G P ) R[{ } U G P Java To evaluate the correctness of our declarative opera- [e] p q [[ e]] E = false tor precedence solution using data-dependent grammars, we Cond-2 ! ( (p, i, E, u, w) , , , ) ( , , , ) used the grammar of Java 7 from the main part of the Java { }[R U G P ) R U G P x=e language specification [11]. This grammar contains an un- p q [[ e]] E = v d =(q, ! i, E[x=v],u,w) ambiguous left-recursive expression grammar, in a similar Bind ( (p, i, E, u, w) , , , ) ( d , , , ) style to the expression grammar in Figure 1 (middle). { }[R U G P ) R[{ } U G P We replaced the expression part (consisting of about 30 Figure 17. GLL for data-dependent ATN grammars. nonterminals) of the Java specification grammar with a sin- gle Expression nonterminal that declaratively expresses oper- ator precedence using >, left and right. The resulting gram- is created. Each descriptor in the set corresponds to a re- mar, which we refer to as the natural grammar, is much sult, nonterminal node y, retrieved from , so that the index P more concise and readable, see Table 1. The resulting parser of the descriptor is the right extent of y (rext), its environ- parsed all 8067 files successfully and without ambiguity. ment contains bindings [l = props(y)] and [x = val(y)] (val The natural grammar of Java produces different parse retrieves the value from y), and its SPPF node is a new in- trees compared to the original specification grammar, and termediate node. Note that d ensures that no duplicate 62 U therefore it is not possible to directly compare the parse descriptors are added at this point. If the corresponding GSS trees. To test the correctness of parsers resulting from the node does not exist, Call-2 creates one descriptor for each desugaring of >, left, and right to data-dependent gram- start state of the nonterminal (s S(A)). Each descriptor 2 mars, we tested their resulting parse trees against a GLL gets a new environment with binding [p0 = v], where p0 is parser for the same natural grammar of Java, using our the nonterminal’s parameter that we assume to have a unique previous work on rewriting operator precedence rules [3]. name in the scope of a rule. Both Call-1 and Call-2 add a new Both parsers, using desugaring to data-dependent grammars GSS edge capturing the previous environment to . G and rewriting operator precedence rules, produced the same Finally, in Ret-rule, the return expression is evaluated, and the nonterminal node is created which stores both the 3 https://github.com/iguana-parser/grammars

14 Table 1. Summary of the results of parsing with character-level data-dependent grammars of programming languages. Spec. Grammar Data-dep. Grammar Language Challenge Solution # Files % Success # Nont. # Rules # Nont. # Rules

Java Operator precedence >, left and right 200 485 169 435 8067 100% (8067) C# Conditional directives global variables and dynamic layout 387 1000 395 1013 5839 99% (5838) Haskell Indentation sensitivity align, offside and ignore 143 431 152 452 6457 72% (4657) parse trees for all Java files, providing an evidence that our Table 2. Summary of the projects used in the evaluation. desugaring of operator precedence to data-dependent gram- Lang. Projects Version Description mars implements the same semantics as the rewriting in [3]. JDK 1.7.0_60-b19 Java Development Kit Java Despite its prototype status, the data-dependent parser is JUnit 4.12 Unit testing framework at the moment on average only 25% slower than the rewrit- SLF4J 1.7.12 A Java logging framework ten one. The main reason for performance difference is that Roslyn build-preview .NET Compiler Platform C# in the rewriting technique [3] the precedence information is MVC 6.0.0-beta5 ASP.NET MVC Framework statically encoded in the grammar, and therefore there is no EntityFramew. 7.0.0-beta5 Data access for .NET runtime overhead, while in the data-dependent version pass- GHC 7.8 Glasgow Haskell Compiler Haskell ing arguments and handling environment is done at runtime. Cabal 1.22.4.0 Build System for Haskell The problem with the rewriting technique is that the rewrit- Git-annex 5.20150710 File manager based on Git ing process itself is rather slow and the resulting grammar is Fay 0.23.1.6 Haskell to JavaScript compiler very large.

C# To evaluate our data-dependent framework on parsing conditional directives, we used the grammar of C# 5 from mar of Haskell [28]. The specification grammar of Haskell the C# language specification [30]. As mentioned in Sec- is written using explicit blocks, as if no indentation sensi- tion 2.5, existing C# resolve preprocessor direc- tivity exists, and the lexer translates indentation to physical tives in the lexing phase, and the parser is not aware of direc- block delimiters. We took the Haskell grammar as written in tives. However, the C# language specification has context- the specification as the starting point and added extra rules free rules that describe the syntax of directives. Our solution that specify layout sensitivity using align, offside and ignore to parsing conditional directives (Section 3.6) leverages lay- constructs. As shown in Table 1, the data-dependent version out that is automatically inserted between symbols in gram- has only 21 additional rules (about 4% of the whole gram- mar rules. We used the context-free syntax of directives in mar). From the total number of 6457 Haskell files, we could C# as the starting point. We extended the layout definition to successfully parse 4657 files (72%). The reason for the parse include directives. Then, the conditional directive rules were error in other files is that they contained some syntactic con- modified to allow parse-time evaluation of conditions and structs from GHC extensions that we do not support yet. selection of the corresponding path. Besides numerous undocumented GHC extensions we The resulting data-dependent grammar is only different found in the source files, many Haskell files contained CPP from the specification grammar in the layout definition, and directives which were resolved by running the C preproces- the difference is minimal. As can be seen in Table 1 there sor, cpp, before parsing. In the future, we plan to deal with are only 8 additional nonterminals and 13 additional rules C directives during parsing, the same way we did for C#. (about 1.3% of the whole grammar). Using the character- One last issue about parsing Haskell is that indentation rules level grammar of C# we could parse 5838 files out of 5839. alone are not sufficient to unambiguously parse Haskell, and The parser timed out after 30 seconds on a very large source there is a need for a syntactic longest match that uses inden- file from the Roslyn framework. The file, which appears to tation information. For example, the following input string be automatically generated, contains 156033 lines of code is ambiguous, where both derivations are correct regarding and is of size 4.8 MB. the indentation rules: Although the grammar of C# is near deterministic, the fx=do print x reason for time out is that character-level grammars generate +1 very large parse trees, a node for each character. Neverthe- In the first derivation, the right hand side is an infix plus- less, this file could be parsed using a context-aware parser expression, consisting of a do-expression and 1. The sec- of C#. We discuss the performance gain of using a context- ond derivation consists of only a do-expression that has aware scanner in Section 5.1. print x + 1 as its subexpression. According to the Haskell language specification the second interpretation is valid, as Haskell To evaluate our parsing framework for indenta- in do expressions longest match should be applied. We re- tion sensitive programming languages, we used the gram- solved this issue by defining a special kind of follow re-

15 Java C# Haskell y = 1.212 x − 3.181 y = 1.098 x − 3 y = 0.95 x − 1.616 R 2 = 0.9395 R 2 = 0.9821 R 2 = 0.95 0123 Regression line Regression line Regression line − 1 0 1 2 3 4 − 1 0 1 2 3 CPU time (milliseconds) in log10

2345 23456 012345 size (#characters) in log10 Figure 18. Running time of the character-level parsers for Java, C#, and Haskell against the input size (number of characters) plotted as log-log base 10. The red line is the linear regression fit. The goodness of each fit is indicated by the adjusted R2 value in each log-log plot. The equation in each plot describes a power relation with original data, and as all the coefficients (1.212, 1.098, 0.950) are close to one, we can conclude the running time is near-linear on these grammars. striction, similar to Section 3.3, that bypasses the layout and checks for the indentation level of the next non-whitespace Java token when the token is not a keyword or a delimiter. C# OCaml We used excerpts of OCaml source files to test our operator precedence translation against deep and problem- Haskell atic operator precedence cases. OCaml, in contrast to other three programming languages we used for the evaluation, uses a natural, ambiguous expression grammar in its lan- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 guage specification. The data-dependent grammar of OCaml is basically the same as the reference manual, where the al- Speedup ternative operator in the expression part is replaced with > Figure 19. The relative speedup using context-aware scan- and additional left and right operators added. We are not ning instead of character level grammars. yet able to unambiguously parse full OCaml programs, as they contain operator precedence ambiguities across indirect nonterminals. An example is pattern-matching which can de- free grammars [33]. The more practical question, however, rive expr on its right-most end: is how data dependency affects the runtime performance of expr ::= expr '+' expr parsing real programming languages. | 'function' pattern-matching In this section, we provide empirical results showing that pattern-matching ::= pattern '->' expr parsers for data-dependent grammars can behave nearly lin- For example, the input string function x -> x + 1 is ambigu- early on grammars of real programming languages. We ran ous with the following derivations: (function x -> x) + 1 the experiments on a machine with a quad-core Intel Core i7 or function x -> (x + 1). As the function alternative has 2.6 GHz CPU and 16 GB of physical memory running Mac pattern-matching and not expr on its right-most end, the op- OS X 10.10.4. We used a 64-Bit Oracle HotSpot™ JVM ver- erator precedence rules do not apply in our current scheme. sion 1.8.0_25. Each file was executed 10 times and the mean The translation of operator precedence in presence of indi- running time (CPU user time) was reported. The three first rect nonterminals to data-dependent grammars seems possi- runs of each file were skipped to allow for JIT optimizations. ble with an additional analysis of indirect nonterminals, but Figure 18 shows the log-log plots (log base 10) of running is left as future work. time (ms) against file size (number of characters) for all the files we parsed (See Table 1 and 2). For showing the linear 5.1 Running Time and Performance behavior we used the character-level grammars, as they ex- Data-dependent grammars [16] provide a pay-as-you-go hibit the relationship between the running time and the num- model. If a pure context-free grammar is specified, the ber of characters better than the context-aware version. As worst-case complexity of the underlying parsing technol- can be seen, all three parsers exhibit near-linear behavior on ogy is retained. However, in the general case no guarantees grammars of programming languages. can be made. Our data-dependent parsers, implemented on To compare the performance difference between character- top of GLL parsing, are worst-case O(n3) on pure context- level and context-aware parsing, we ran both context-aware

16 and character-level parsers on all the source files. Fig- internal workings of the parser. Since parser combinators ure 19 shows the relative performance gain (speedup) using are normal functions, the user can modify them. Our ap- a context-aware parser compared to a character-level parser proach provides an external DSL for defining parsers, while for each file. For a better visualization we omitted the out- parser combinators provide an internal DSL. Therefore, our liers from the box plots. The median and maximum speedup approach compared to parser combinators provides more for Java is (2.45, 15.1), for C# (2.45, 4) and for Haskell (1.9, control over the syntax definition. 3). The precise impact of context-aware scanning for general Erdweg et al. present an extension of SDF to define parsing and data-dependent grammars is future work, but our layout constraints on grammar rules [8]. These constraint- preliminary investigation revealed that using character-level based approach is implemented by modifying the underly- grammars for parsing layout is very expensive, as it is a very ing SGLR [38] parser. Most constraints can be solved during common operation, see Section 3.2. parsing. Constraints that are not resolved will lead to ambi- guity which can be removed by post-parse filtering. Adams 6. Related Work presents the notion of indentation-sensitive grammars [1], where symbols in a rule are annotated by the relative posi- Parsing is a well-researched topic, and many features of our tion to the immediate parents. This technique is implemented parsing framework are related in one or another way to other for LR(k) parsing. existing systems. Throughout this paper we have discussed We do not offer a customized solution for indentation- some related work, which we do not repeat here. In this sensitivity for a specific parsing technology, rather we use section we discuss direct related work and our inspirations. the general data-dependent grammars framework, and map Data dependency implementation Data-dependent gram- indentation rules to them. In addition, we define high-level mars have many similarities with attribute grammars [27] constructs such as align, offside, and ignore which are and attribute directed parsing [39]. A detailed discussion of desugared to lower-level data-dependent grammars. This en- related systems is provided by Jim et al. [16]. From the ables a syntax definition model that is closer to what the user implementation perspective, Jim et al. present the Yakker has in mind. We think the use of high-level constructs leads parser generator [16], which is based on Earley’s algo- to cleaner, more maintainable grammars. rithm [7], but we have a GLL-based interpretation of data- dependent grammars. We also extend the SPPF creation Operator precedence SDF2 uses a parser-independent se- functionality of GLL parsing (taking environments into ac- mantics of operator precedence which is based on parent- count), while SPPF creation is not discussed in Jim et al. ’s child relationship on derivation trees [38]. This semantics is approach. Another difference between our implementation implemented in SGLR parsing [38] by modifying parse ta- and Yakker is that Yakker directly supports regular opera- bles. Although the SDF2 semantics for operator precedence tors, by applying longest match. We, however, believe that works for most cases, in some cases it is too strong, i.e., all ambiguities should be returned by the parser, and avoid rejecting valid sentences, and in some cases it cannot disam- such implicit heuristics. Therefore, we desugar regular oper- biguate the expression grammar. ators to data-dependent BNF rules. In earlier work [3] we discussed the precedence ambigu- We use an interpretative model of parsing based on ity, and proposed a grammar rewriting that takes an ambigu- Woods’ ATN grammars [40]. Woods used an explicit stack ous grammar with a set of precedence rules and produces a to run ATN grammars, similar to a pushdown automata. grammar that does not allow precedence-invalid derivations. However, as with any top-down parser, such execution of Our current solution has the same semantics: it does not re- ATN grammars does not terminate in presence of left recur- move sentences when there is no precedence ambiguity, and sion. Jim et al. ’s data-dependent framework operates on a can deal with corner cases found in programming languages data-dependent automata [15], which is a variation of ATN such as OCaml. In addition, our operator precedence solu- grammars interpreted with Earley’s algorithm. tion is desugared to data-dependent grammars, thus it is in- dependent of the underlying parsing technology. Indentation-sensitive parsing Besides modification to the lexer which has been used in GHC and CPython, there Conditional directives Recent work in parsing condi- are a number of other systems that provide a solution for tional directives target all variations [10, 21]. Gazzillo and indentation-sensitive parsing. Parser combinators [13] are Grimm [10] give an extensive overview of related work in higher-order functions that are used to define grammars in this area. However, to the best of our knowledge, none of terms of constructs such as alternation and sequence. This the existing systems employ a single-phase parsing scheme, approach has been used in parsing indentation-sensitive lan- rather they use a separate scanner and annotate the tokens guages [14]. Traditional parser combinators do not support based on the conditional directives they appear in. Our ap- left-recursion and can have exponential runtime. Another proach in using data-dependent grammars to evaluate the main difference between parser combinators and our ap- conditional directives is new. The treatment of other features proach is that we do not give the end user access to the of preprocessors, such as macros, is future work.

17 7. Conclusion [11] J. Gosling, B. Joy, G. Steele, G. Bracha, and A. Buckley. The Java Language Specification Java SE 7 Edition, 2013. We have presented our vision of a parsing framework that is able to address many challenges of declarative parsing of [12] J. Heering, P. R. H. Hendriks, P. Klint, and J. Rekers. The real programming languages. We have built an implemen- Syntax Definition Formalism SDF–Reference Manual–. SIG- tation of data-dependent grammars based on the GLL pars- PLAN Not., 24(11):43–75, Nov. 1989. ing algorithm. We also have shown how to map common [13] G. Hutton. Higher-order Functions for Parsing. Journal of idioms in syntax of programming languages, such as lexi- Functional Programming, 2(3):323–343, July 1992. cal disambiguation filters, operator precedence, indentation- [14] G. Hutton and E. Meijer. Monadic Parsing in Haskell. J. sensitivity, and conditional directives to data-dependent Funct. Program., 8(4):437–444, 1998. grammars. These mappings provide the language engineer [15] T. Jim and Y. Mandelbaum. Efficient Earley Parsing with with a set of out of the box constructs, while at the same Regular Right-hand Sides. 253(7):135 – 148, 2010. LDTA’09. time, new high-level constructs can be added. The prelimi- [16] T. Jim, Y. Mandelbaum, and D. Walker. Semantics and Algo- nary experiments with our parsing framework show that it rithms for Data-dependent Grammars. In Principles of Pro- can be efficient and practical. To fully realize our vision we gramming Languages, POPL’10, pages 417–430. ACM, 2010. will explore more syntactic features, and further optimize the implementation of our framework. [17] M. Johnson. The Computational Complexity of GLR Parsing. In Generalized LR Parsing, pages 35–42. Springer US, 1991. Acknowledgments [18] S. C. Johnson. Yacc: Yet Another Compiler-Compiler. Tech- nical report, AT&T Bell Laboratories, 1979. We are thankful to Jurgen Vinju, Paul Klint, and the anony- mous reviewers for their constructive feedback on earlier [19] A. Johnstone and E. Scott. Modelling GLL Parser Implemen- tations. In Software Language Engineering - 3rd International versions of this paper. Conference, SLE ’10, pages 42–61, 2010. References [20] A. Johnstone, E. Scott, and M. van den Brand. Modular Grammar Specification. Sci. Comput. Prog., 87:23–43, 2014. [1] M. D. Adams. Principled Parsing for Indentation-sensitive Languages: Revisiting Landin’s Offside Rule. In Principles of [21] C. Kästner, P. G. Giarrusso, T. Rendel, S. Erdweg, K. Oster- Programming Languages, POPL ’13, pages 511–522. ACM, mann, and T. Berger. Variability-aware Parsing in the Pres- 2013. ence of Lexical Macros and Conditional Compilation. In Ob- ject Oriented Programming Systems Languages and Applica- [2] A. Afroozeh and A. Izmaylova. Faster, Practical GLL Parsing. tions, OOPSLA ’11, pages 805–824, 2011. In Compiler Construction, 24th International Conference, CC ’15, pages 89–108. Springer, 2015. [22] L. C. Kats and E. Visser. The Spoofax Language Workbench: Rules for Declarative Specification of Languages and IDEs. In [3] A. Afroozeh, M. van den Brand, A. Johnstone, E. Scott, and Object Oriented Programming Systems Languages and Appli- J. J. Vinju. Safe Specification of Operator Precedence Rules. cations, OOPSLA ’10, pages 444–463. ACM, 2010. In Software Language Engineering, SLE ’13, pages 137–156. Springer, 2013. [23] L. C. Kats, E. Visser, and G. Wachsmuth. Pure and Declarative Syntax Definition: Paradise Lost and Regained. In Object [4] A. V. Aho, S. C. Johnson, and J. D. Ullman. Deterministic Oriented Programming Systems Languages and Applications, Parsing of Ambiguous Grammars. In Principles of Program- OOPSLA ’10, pages 918–932. ACM, 2010. ming Languages, POPL ’73, pages 1–21, 1973. [5] K. Clarke. The Top-down Parsing of Expressions. Technical [24] P. Klint, T. van der Storm, and J. Vinju. RASCAL: a Domain report, Dept. of Computer Science and Statistics, Queen Mary Specific Language for Source Code Analysis and Manipula- College, London, June 1986. tion. SCAM’09. IEEE, 2009. [6] F. L. DeRemer. Practical Translators for LR(k) Languages. [25] D. E. Knuth. On the Translation of Languages from Left to PhD thesis, Massachusetts Institute of Technology, 1969. Right. Information and control, 8(6):607–639, 1965. [7] J. Earley. An Efficient Context-free Parsing Algorithm. Com- [26] P. J. Landin. The Next 700 Programming Languages. Com- mun. ACM, 13(2):94–102, Feb. 1970. ISSN 0001-0782. mun. ACM, 9(3):157–166, Mar. 1966. [8] S. Erdweg, T. Rendel, C. Kästner, and K. Ostermann. Layout- [27] P. M. Lewis, D. J. Rosenkrantz, and R. E. Stearns. Attributed Sensitive Generalized Parsing. In Software Language Engi- Translations. J. Comput. Syst. Sci., 9(3):279–307, 1974. neering , SLE’12, pages 244–263. Springer, 2012. [28] S. Marlow. Haskell 2010 language report, 2010. [9] K. Fisher and R. Gruber. PADS: A Domain-specific Language [29] A. Melnikov. Collected Extensions to IMAP4 ABNF, 2006. for Processing Ad Hoc Data. In Programming Language Design and Implementation, PLDI ’05, pages 295–304. ACM, [30] Microsoft Corp. C# Language Specification 5.0. 2013. 2005. [31] T. Parr, S. Harwell, and K. Fisher. Adaptive LL(*) Parsing: [10] P. Gazzillo and R. Grimm. SuperC: Parsing All of C by The Power of Dynamic Analysis. In Object Oriented Pro- Taming the Preprocessor. In Programming Language Design gramming Systems Languages and Applications, OOPSLA and Implementation, PLDI ’12, pages 323–334. ACM, 2012. ’14, pages 579–598. ACM, 2014.

18 [32] D. J. Salomon and G. V. Cormack. Scannerless NSLR(1) Pars- E, 0, 4 E ::= E . + E ing of Programming Languages. In Programming Language Design and Implementation, PLDI ’89, pages 170–178, 1989. E, 1, 4 E ::= E . + E, 0, 3 E, 1 E ::= - E. [33] E. Scott and A. Johnstone. GLL Parse-tree Generation. Science of , 78(10):1828–1844, Oct. E, 0, 2 E ::= E + . E, 1, 3 E, 3, 4 E, 0 E ::= E + E. 2013. E ::= E . + E [34] E. Scott, A. Johnstone, and R. Economopoulos. BRNGLR: A E ::= E + E. E, 1, 2 E, 3 Cubic Tomita-style GLR Parsing Algorithm. Acta informat- ica, 44(6):427–461, 2007. '-', 0, 1 'a', 1, 2 '+', 2, 3 'a', 3, 4 E ::= E . + E [35] M. Tomita. Efficient Parsing for Natural Language. Kluwer Academic Publishers, USA, 1985. ISBN 0898382025. Figure 20. SPPF (left) and GSS (right) for the input -a+a. [36] M. G. J. van den Brand, J. Scheerder, J. J. Vinju, and E. Visser. Disambiguation Filters for Scannerless General- ized LR Parsers. In Compiler Construction, CC ’02, pages • intermediate nodes of the form (L, i, j) where L is the 143–158. Springer, 2002. grammar slot, and i and j are the left and right extends. [37] E. R. Van Wyk and A. C. Schwerdfeger. Context-aware Scanning for Parsing Extensible Languages. GPCE ’07, pages The left and right extents of a node represent the substring 63–72. ACM, 2007. in the input, associated with the node. As GLL parsing is [38] E. Visser. Syntax Definition for Language Prototyping. PhD context-free, nodes with the same label, the same left and thesis, University of Amsterdam, 1997. the same right extents can be shared. Nonterminal and inter- [39] D. A. Watt. Rule Splitting and Attribute-directed Parsing. mediate nodes have packed nodes as their children. Packed In Semantics-Directed Compiler Generation, pages 363–392. nodes represent a derivation, and can have at most two chil- 1980. dren, which are non-packed nodes. If a non-packed node is [40] W. A. Woods. Transition Network Grammars for Natural ambiguous, it will have more than one packed node. The Language Analysis. Commun. ACM, 13(10):591–606, 1970. pivot of a packed node is the right extent of its left child, and is used to distinguish between packed nodes under a non- A. GLL Parsing packed node. In this section we first describe GLL parsing, SPPF construc- The binarized SPPF resulting from parsing the input string -a+a with the grammar E ::= E E +E a is shown tion and GSS. Then, we define the semantics of GLL parsing | | over ATN grammars as a transition relation. in Figure 20 (left), where packed nodes are depicted with small circles. For a better visualization, we have omitted the A.1 SPPF labels of packed nodes. The input is ambiguous and has the It is known that any parsing algorithm that constructs Tomita- following two derivations: (-(a+a)) or ((-a)+a). This can be style SPPF is of unbounded polynomial complexity [17]. To observed by the presence of two packed nodes under the root achieve parsing in cubic time and space, GLL uses a bina- node. The left and right packed nodes under the root node rized SPPF [33] format, which has additional intermediate correspond to the first and second alternatives, respectively. nodes. Intermediate nodes allow grouping of the symbols of SPPF construction is delegated to two functions nodeT a rule in a left-associative manner, thus allowing the parser and nodeP. The nodeT(t, i, j) function takes terminal t, to always carry a single node at each time, instead of a list and two integer values i and j (left and right extents) and of nodes. This is the key in preserving the cubic bound. The returns an existing node with these properties, otherwise use of intermediate nodes effectively achieves the same as a new node. nodeP(L, w, z) takes a grammar slot L, and restricting a grammar to have rules of length at most two, two non-packed nodes w and z. nodeP returns an existing but without requiring rewriting the original grammar, and non-packed node labeled L with two children w and z. transforming back the resulting derivation trees to the ones If no such node exists, then a non-packed node labeled L of the original grammar. will be created, and w and z are connected to the newly created non-packed node via a packed node. The details Definition 1. A binarized SPPF is a compact representation of GLL construction is discussed in [33], and of a parse forest that has the following types of nodes. implementation techniques for efficient sharing of nodes are • nonterminal nodes of the form (A, i, j) where A is a presented in [2, 19]. nonterminal, and i and j are the left and right extents; • terminal nodes of the form (t, i, j) where t is a terminal, A.2 GSS and i and j are the left and right extents; At the core of GLL parsing is the Graph-Structured Stack • packed nodes of the form (L, k) where L is a grammar data structure. We use a variation of GLL parsing that uses a slot and k is the pivot of the node; and more efficient GSS [2].

19 ( , , , ) ( 0, 0, 0, 0) Such GLL formulation is concise and easy to extend to R U G P ) R U G P support data-dependent grammars. The rules in Figure 21 ✏ p q ! use notation similar to one in [2, 33]. n = nodeP(q, w, nodeT(✏, i, i)) Eps The unit of work of a GLL parser is a descriptor. A de- ( (p, i, u, w) , , , ) ( (q, i, u, n) , , , ) { }[R U G P ) R[{ } U G P scriptor is of the form (p, i, u, w), where p is an ATN state t p qI[i]=t representing a grammar slot, u is a GSS node, i is an in- ! n = nodeP(q, w, nodeT(t, i, i+1)) put position, and w is an SPPF (non-packed) node. A GLL Term-1 ( (p, i, u, w) , , , ) ( (q, i+1,u,n) , , , ) parser maintains a set that holds descriptors created during { }[R U G P ) R[{ } U G P U t parsing and is used to eliminate duplicate descriptors. In ad- p qI[i] = t Term-2 ! 6 dition to , a set is used to hold pending descriptors that ( (p, i, u, w) , , , ) ( , , , ) U R { }[R U G P ) R U G P are to be processed. Note that GLL parsing does not impose A p qv=(A, i) ( ) any order in which the descriptors in are processed. Fig- ! 2N G R = d (v, y) ,d=(q, rext(y),u,nodeP(q, w, y)), ure 21 defines the semantics of GLL parsing over ATN gram- D { | 2P d Call-1 62 U} mars as a transition relation on configuration ( , , , ), ( (p, i, u, w) , , , ) ( , , R U G P { }[R U G P ) R[D U[D where represents GSS (a set of GSS edges), such that (v, q, w, u) , ) G G[{ } P ( ) gives a set of GSS nodes, and is a set of parsing A N G P p qv=(A, i) ( ) results that are associated with GSS nodes, i.e., a set of ele- != (s, i, v, $) s 62S N(AG) Call-2 D { | 2 } ments of the form (u, w). ( (p, i, u, w) , , , ) ( , , (v, q, w, u) , ) { }[R U G P ) R[D U G[{ } P During parsing a descriptor is selected and removed from p F , represented as (p, i, u, w) , and given the rules, a 2 R { }[R = d (u, q, y, v) ,d=(q, i, v, nodeP(q, y, w)),d deterministic choice is made based on the next transition in Ret D { | 2G 62 U} ( (p, i, u, w) , , , ) ( , , , (u, n) ) the ATN. The first three rules of Figure 21 are straightfor- { }[R U G P ) R[D U[D G P[{ } ward. An ✏ transition creates an ✏-node (via call to nodeT) Figure 21. GLL parsing over ATN grammars. and intermediate node4 (via call to nodeP), and adds a de- scriptor for the next grammar slot. The terminal rules (Term- 1 and Term-2) try to match terminal t at the current input Definition 2. A Graph-Structured Stack (GSS) in GLL pars- position, where I is an array representing the input string. If ing is a directed graph where there is a match (Term-1), a terminal node (via nodeT) and • nodes are of the form (A, i), where A is a nonterminal intermediate node (via nodeP) are created, and a descrip- and i is an input position; and tor for the next grammar slot is added. If there is no match • edges are of the form (u, L, w, v), where u and v are (Term-2), no descriptor is added. GSS nodes, L is a grammar slot, and w is an SPPF node Call-1 and Call-2 correspond to nonterminal transitions A recorded on the edge. . Similar to calling a memoized function, a GLL parser ! first checks if a GSS node (A, i) exists. If such a node exists GSS was originally developed by Tomita [35] for GLR pars- (Call-1), the parsing results associated with this GSS node ing to merge different LR stacks. Although GLL parsing are reused. These results are retrieved from , and for each P uses the same term, there are two main differences between result, nonterminal node y, a descriptor d is created (rext GSS in GLL parsing and GLR. First, in GLL parsing GSS returns the right extent of y), and if the same descriptor has represents function calls in recursive-descent parsing, simi- not been processed before (d ), it is added to . If the 62 U R lar to of functions in functional programming, GSS node does not exist (Call-2), the call to the nonterminal and therefore has the input position at which the nonterminal is made, i.e., for each start state of the nonterminal (s 2 is called. Second, in GLL parsing GSS allows cycles in the S(A)), a descriptor is added. Both Call-1 and Call-2 add a graph that solve the problem of left-recursion in recursive- new GSS edge to . G descent parsing. Finally, Ret corresponds to a final grammar slot (final The GSS resulting from parsing -a+a using the grammar states in ATNs) in which the parser returns from the current E ::= E E + E a is shown in Figure 20 (right). As can | | nonterminal call. First, the tuple with the current SPPF node be seen there is a cycle on all nodes, as they represent the and the current GSS node is added to . Second, for each P left recursive calls to E at different input positions. In case outgoing GSS edge of the current GSS node, a descriptor is of indirect left recursion, there will be a cycle in the GSS created and, if the same descriptor has not been processed involving multiple nodes. before (d ), it is added to . 62U R A.3 GLL Parsing over ATN Grammars 4 In this section, we define GLL parsing over ATN grammars In fact, when the next state is an end state, nodeP creates a nonterminal node, instead of an intermediate node. However, in the current discussion, as a transition relation. In contrast to the imperative style this is not essential, therefore, we always refer to the result of nodeP as an used in [2, 33], we use the declarative rules of Figure 21. intermediate node.

20