One Parser to Rule Them All
Total Page:16
File Type:pdf, Size:1020Kb
One Parser to Rule Them All Ali Afroozeh Anastasia Izmaylova Centrum Wiskunde & Informatica, Amsterdam, The Netherlands {ali.afroozeh, anastasia.izmaylova}@cwi.nl Abstract 1. Introduction Despite the long history of research in parsing, constructing Parsing is a well-researched topic in computer science, and parsers for real programming languages remains a difficult it is common to hear from fellow researchers in the field and painful task. In the last decades, different parser gen- of programming languages that parsing is a solved prob- erators emerged to allow the construction of parsers from a lem. This statement mostly originates from the success of BNF-like specification. However, still today, many parsers Yacc [18] and its underlying theory that has been developed are handwritten, or are only partly generated, and include in the 70s. Since Knuth’s seminal paper on LR parsing [25], various hacks to deal with different peculiarities in program- and DeRemer’s work on practical LR parsing (LALR) [6], ming languages. The main problem is that current declara- there is a linear parsing technique that covers most syntactic tive syntax definition techniques are based on pure context- constructs in programming languages. Yacc, and its various free grammars, while many constructs found in program- ports to other languages, enabled the generation of efficient ming languages require context information. parsers from a BNF specification. Still, research papers and In this paper we propose a parsing framework that em- tools on parsing in the last four decades show an ongoing braces context information in its core. Our framework is effort to develop new parsing techniques. based on data-dependent grammars, which extend context- A central goal in research in parsing has been to en- free grammars with arbitrary computation, variable binding able language engineers (i.e., language designers and tool and constraints. We present an implementation of our frame- builders) to declaratively build a parser for a real program- work on top of the Generalized LL (GLL) parsing algorithm, ming language from an (E)BNF-like specification. Never- and show how common idioms in syntax of programming theless, still today, many parsers are hand-written or are languages such as (1) lexical disambiguation filters, (2) op- only partially generated and include many hacks to deal erator precedence, (3) indentation-sensitive rules, and (4) with peculiarities in programming languages. The reason conditional preprocessor directives can be mapped to data- is that grammars of programming languages in their sim- dependent grammars. We demonstrate the initial experience ple and readable form are often not deterministic and also with our framework, by parsing more than 20 000 Java, C#, often ambiguous. Moreover, many constructs found in pro- Haskell, and OCaml source files. gramming languages are not context-free, e.g., indentation rules in Haskell. Parser generators based on pure context- Categories and Subject Descriptors D.3.1 [Programming free grammars cannot natively deal with such constructs, Languages]: Formal Definitions and Theory—Syntax; D.3.4 and require ad-hoc extensions or hacks in the lexer. There- [Programming Languages]: Processors—Parsing fore, additional means are necessary outside of the power of context-free grammars to address these issues. Keywords Parsing, data-dependent grammars, GLL, dis- General parsing algorithms [7, 33, 35] support all context- ambiguation, operator precedence, offside rule, preprocessor free grammars, therefore the language engineer is not lim- directives, scannerless parsing, context-aware scanning ited by a specific deterministic class, and there are known declarative disambiguation constructs to address the prob- lem of ambiguity in general parsing [12, 36, 38]. However, implementing disambiguation constructs is notoriously hard and requires thorough knowledge of the underlying parsing technology. This means that it is costly to declaratively build a parser for a given programming language in the wild if the required disambiguation constructs are not already sup- Copyright ©ACM, 2015. This is the author’s version of the work. It is posted here ported. Perhaps surprisingly, examples of such languages are by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Onward!’15, October 25–30, 2015, Pittsburgh, PA, USA. not only the legacy languages, but also modern languages http://dx.doi.org/10.1145/2814228.2814242. such as Haskell, Python, OCaml and C#. 1 In this paper we propose a parsing framework that is able Our vision is a parsing framework that provides the right to deal with many challenges in parsing existing and new level of abstraction for both the language engineer, who de- programming languages. We embrace the need for context signs new languages, and the tool builder, who needs a pars- information at runtime, and base our framework on data- ing technology as part of her toolset. From the language en- dependent grammars [16]. Data-dependent grammars are an gineer’s perspective, our parsing framework provides an out extension of context-free grammars that allow arbitrary com- of the box set of high-level constructs for most common id- putation, variable binding and constraints. These features al- ioms in syntax of programming languages. The language en- low us to simulate hand-written parsers and to implement gineer can also always express her needs directly using data- disambiguation constructs. dependent grammars. From the tool builder’s perspective, To demonstrate the concept of data-dependent grammars our framework provides open, extensible means to define we use the IMAP protocol [29]. In network protocol mes- higher-level syntactic notation, without requiring knowledge sages it is common to send the length of data before the ac- of the internal workings of a parsing technology. tual data. In IMAP, these messages are called literals, and The contributions of this paper are: are described by the following (simplified) context-free rule: L8 ::= ' {' Number '}' Octets • We provide a unified perspective on many important chal- ⇠ lenges in parsing real programming languages. Here Octets recognizes a list of octet (any 8-bit) values. An example of L8 is ~{6}aaaaaa. As can be seen, there is no • We present several high-level syntactical constructs, and data dependency in this context-free grammar, but the IMAP their mappings to data-dependent grammars. specification says that the number of Octets is determined by • We provide an implementation of data-dependent gram- the value parsed by Number. Using data-dependent grammars, mars on top of Generalized LL (GLL) parsing [33] that we can specify such a data-dependency as: runs over ATN grammars [40]. The implementation is 1 L8 ::= ' {' nm:Number {n=toInt(nm.yield)} '}' Octets(n) part of the Iguana parsing framework [2]. ⇠ Octets(n) ::= [n > 0] Octets(n - 1) Octet • | [n == 0] ✏ We demonstrate the initial results of our parsing frame- work, by parsing 20363 real source files of Java, C#, In the data-dependent version, nm provides access to the Haskell (91% success rate), and excerpts from OCaml. value parsed by Number. We retrieve the substring of the input parsed by Number via nm.yield which is converted to integer The rest of this paper is organized as follows. In Section 2 we using toInt. This integer value is bound to variable n, and describe the landscape of parsing programming languages. is passed to Octets. Octets takes an argument that specifies In Section 3 we present data-dependent grammars, our high- the number of iterations. Conditions [n > 0] and [n == 0] level syntactic notation, and the mapping to data-dependent specify which alternative is selected at each iteration. grammars. Section 4 discusses the extension of GLL parsing It is possible to parse IMAP using a general parser, and with data-dependency. In Section 5 we demonstrate the ini- then remove the derivations that violate data dependencies tial results of our parsing framework using grammars of real post parse. However, such an approach would be slow. With- programming languages. We discuss related work in Sec- out enforcing the dependency on the length of Octets during tion 6. A conclusion and discussion of future work is given parsing, given the nondeterministic nature of general pars- in Section 7. ing, all possible lengths of Octets will be tried. There are many common grammar and disambiguation 2. The Landscape of Parsing Programming idioms that can be desugared into data-dependent grammars. Languages Examples of these idioms are operator precedence, longest In this section we discuss well-known features of program- match, and the offside rule. Expecting the language engi- ming languages that make them hard to parse. These features neer to write low-level data-dependent grammars for such motivate our design decisions. cases would be wasteful. Instead, we describe a number of such idioms, provide high-level notation for them and their 2.1 General Parsing for Programming Languages desugaring to data-dependent grammars. For example, using Grammars of programming languages in their natural form our high-level notation the indentation rules in Haskell can are not deterministic, and are often ambiguous. A well- be expressed as follows: known example is the if-then-else construct found in many Decls ::= align (offside Decl)* programming languages. This construct, when written in | ignore('{' Decl (';' Decl) '}') * its natural form, is ambiguous (the dangling-else ambigu- This definition clearly and concisely specifies that either all ity), and therefore cannot be deterministic. Some nonde- declarations in the list are aligned, and each Decl is offsided terministic (and ambiguous) syntactic constructs, such as with regard to its first token (first alternative), or indentation if-then-else, can be rewritten to be deterministic and un- is ignored inside curly braces (second alternative). 1 https://github.com/iguana-parser 2 ambiguous. However, such grammar rewriting in general is comes very important when considering parsing program- not trivial, the resulting grammar is hard to read, maintain ming languages.