A Typed, Algebraic Approach to Parsing

A Typed, Algebraic Approach to Parsing Neel Krishnaswami Jeremy Yallop University of Cambridge University of Cambridge [email protected] [email protected] Abstract by composing higher-order functions. This allows building In this paper, we recall the definition of the context-free ex- up parsers with the ordinary abstraction features of the pressions (or µ-regular expressions), an algebraic presenta- programming language (variables, data structures and function of the context-free languages. Then, we define a core tions), which is sufficiently beneficial that these libraries are type system for the context-free expressions which gives a popular in spite of the lack of a clear declarative reading compositional criterion for identifying those context-free ex- of the accepted language and bad worst-case performance pressions which can be parsed unambiguously by predictive (often exponential in the length of the input). algorithms in the style of recursive descent or LL(1). Naturally, we want to have our cake and eat it too: we Next, we show how these typed grammar expressions can want to combine the benefits of parser generators (efficiency be used to derive a parser combinator library which both and a clear declarative reading) with the benefits of parser guarantees linear-time parsing with no backtracking and combinators (ease of building high-level abstractions). The single-token lookahead, and which respects the natural de- two main difficulties are that the binding structure of BNF(a notational semantics of context-free expressions. Finally, we collection of globally-visible, mutually-recursive nontermi- show how to exploit the type information to write a staged nals) is at odds with the structure of programming languages version of this library, which produces dramatic increases (variables are lexically scoped), and that generating efficient in performance, even outperforming code generated by the parsers requires static analysis of the input grammar. standard parser generator tool ocamlyacc. This paper begins by recalling the old observation that the context-free languages can be understood as the extension 1 Introduction of regular expressions with variables and a least-fixed point operator (aka the “µ-regular expressions” [Leiß 1991]). These The theory of parsing is one of the oldest and most well- features replace the concept of nonterminal from BNF, and developed areas in computer science: the bibliography to so facilitate embedding context-free grammars into program- Grune and Jacobs’s Parsing Techniques: A Practical Guide ming languages. Our contributions are then: lists over 1700 references! Nevertheless, the foundations of the subject have remained remarkably stable: context-free • First, we extend the framework of µ-regular expres- languages are specified in Backus–Naur form, and parsers sions by giving them a semantic notion of type, build- for these specifications are implemented using algorithms ing on Brüggemann-Klein and Wood’s characteriza- derived from automata theory. This integration of theory and tion of unambiguous regular expressions [1992]. We practice has yielded many benefits: we have linear-time algo- then use this notion of type as the basis for a syntactic rithms for parsing unambiguous grammars efficiently [Knuth type system which checks whether a context-free ex- 1965; Lewis and Stearns 1968] and with excellent error re- pression is suitable for predictive (or recursive descent) porting for bad input strings [Jeffery 2003]. parsing—i.e., we give a type system for left-factored, However, in languages with good support for higher-order non-left-recursive, unambiguous grammars. We prove functions (such as ML, Haskell, and Scala) it is very popular that this type system is well-behaved (i.e., syntactic to use parser combinators [Hutton 1992] instead. These li- substitution preserves types, and sound with respect braries typically build backtracking recursive descent parsers to the denotational semantics), and that all well-typed grammars are unambiguous. Permission to make digital or hard copies of all or part of this work for • Next, we describe a parser combinator library for OCaml personal or classroom use is granted without fee provided that copies based upon this type system. This library exposes basi- are not made or distributed for profit or commercial advantage and that cally the standard applicative-style API with a higher- copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must order fixed point operator but internally it builds a be honored. Abstracting with credit is permitted. To copy otherwise, or first-order representation of well-typed, binding-safe republish, to post on servers or to redistribute to lists, requires prior specific grammars. Having a first-order representation avail- permission and/or a fee. Request permissions from [email protected]. able lets us reject untypeable grammars. PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA When a parser function is built from this AST, the © 2019 Copyright held by the owner/author(s). Publication rights licensed resulting parsers have extremely predictable perfor- to ACM. ACM ISBN 978-1-4503-6712-7/19/06...$15.00 mance: they are guaranteed to be linear-time and non- https://doi.org/10.1145/3314221.3314625 backtracking, using a single token of lookahead. In 1 PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Neel Krishnaswami and Jeremy Yallop addition, our API has no purely-operational features that our API is in the “monoidal” style (we give sequencing to block backtracking, so the simple reading of disjunc- the type seq: 'a t -> 'b t ->( 'a* 'b)t ), while theirs is in tion as union of languages and sequential composition the “applicative” style (i.e., seq:( 'a -> 'b)t -> 'a t -> 'b t). as concatenation of languages remains valid (unlike in However, as McBride and Paterson [2008] observe, the two other formalisms like packrat parsers). However, while styles are equivalent. the performance is predictable, it remains worse than These combinators build an abstract grammar, which can the performance of code generated by conventional be converted into an actual parser (i.e., a function from parser generators such as ocamlyacc. streams to values, raising an exception on strings not in • Finally, we build a staged version of this library using the grammar) using the parser function. MetaOCaml. Staging lets us entirely eliminate the ab- exception TypeError of string straction overhead of parser combinator libraries. The val parser: 'a t ->( char Stream.t -> 'a) grammar type system proves beneficial, as the types Critically, however, the parser function does not succeed on give very useful information in guiding the generation all abstract grammars: instead, it will raise a TypeError ex- of staged code (indeed, the output of staging looks ception if the grammar is ill-formed (i.e., ambiguous or left- very much like handwritten recursive descent parser recursive). As a result, parser is able to provide a stronger code). Our resulting parsers are very fast, typically guarantee when it does succeed: if it returns a parsing func- outperforming even ocamlyacc. Table-driven parsers tion, that function is guaranteed to parse in linear time with can be viewed as interpreters for a state machine, and single-token lookahead. staging lets us eliminate this interpretive overhead! 2.2 Examples 2 API Overview and Examples General Utilities One benefit of combinator parsing is Before getting into the technical details, we show the user- that it permits programmers to build up a library of abstrac- level API in our OCaml implementation, and give both ex- tions over the basic parsing primitives. Before showing how amples and non-examples of its use. this works, we will define a few utility functions and oper- ators to make the examples more readable. The expression 2.1 The High Level Interface always x is a constant function that always returns x: From a user perspective, we offer a very conventional com- let always x= fun _ ->x binator parsing API. We have an abstract type 'a t repre- We define (++) as an infix version of the seq function, and senting parsers which read a stream of tokens and produce (==>) as an infix map operator to resemble semantic actions a value of type 'a. Elements of this abstract type can be con- in traditional parser generators: structed with the following set of constants and functions: let (++)= seq type 'a t let (==>)pf= map f p val eps: unit t any val chr: char -> char t We also define , an n-ary version of the binary alternative val seq: 'a t -> 'b t ->( 'a* 'b)t alt, using a list fold: val bot: 'a t val any: 'a t list -> 'a list t val alt: 'a t -> 'a t -> 'a t let any gs= List.fold_left alt bot gs val fix:( 'b t -> 'b t) -> 'b t option r val map:( 'a -> 'b) -> 'a t -> 'b t The parser parses either the empty string or alter- natively anything that r parses: eps chr c Here is a parser matching the empty string; is a val option: 'a t -> 'a option t 1 parser matching the single character c ; and seq l r is a let option r= any[eps ==> always None; parser that parses strings with a prefix parsed by l and a r ==>( fun x -> Some x)] suffix parsed by r. The value bot is a parser that never suc- The operations can be combined with the fixed point op- ceeds, and alt l r is a parser that parses strings parsed either erator to define the Kleene star function. The parser star g by l or by r. Finally we have a fixed point operator fix for parses a list of the input parsed by g: either the empty list, 2 constructing recursive grammars .

Load more