Packrat Parsers Can Support Left Recursion ∗
Total Page:16
File Type:pdf, Size:1020Kb
Packrat Parsers Can Support Left Recursion ∗ Alessandro Warth James R. Douglass Todd Millstein University of California, Los Angeles The Boeing Company University of California, Los Angeles and Viewpoints Research Institute [email protected] [email protected] [email protected] Abstract • They impose no separation between lexical analysis and pars- Packrat parsing offers several advantages over other parsing tech- ing. This feature, sometimes referred to as scannerless pars- niques, such as the guarantee of linear parse times while supporting ing [10], eliminates the need for moded lexers [9] when combin- backtracking and unlimited look-ahead. Unfortunately, the limited ing grammars (e.g., in Domain-Specific Embedded Language support for left recursion in packrat parser implementations makes (DSEL) implementations). them difficult to use for a large class of grammars (Java’s, for exam- Unfortunately, “like other recursive descent parsers, packrat ple). This paper presents a modification to the memoization mech- parsers cannot support left-recursion” [6], which is typically used anism used by packrat parser implementations that makes it possi- to express the syntax of left-associative operators. To better under- ble for them to support (even indirectly or mutually) left-recursive stand this limitation, consider the following rule for parsing expres- rules. While it is possible for a packrat parser with our modification sions: to yield super-linear parse times for some left-recursive grammars, our experiments show that this is not the case for typical uses of expr ::= <expr> "-" <num> / <num> left recursion. Note that the first alternative in expr begins with expr itself. Be- Categories and Subject Descriptors D.3.4 [Programming Lan- cause the choice operator in packrat parsers (denoted here by “/”) guages]: Processors—Parsing tries each alternative in order, this recursion will never terminate: an application of expr will result in another application of expr with- General Terms Languages, Algorithms, Design, Performance out consuming any input, which in turn will result in yet another application of expr, and so on. The second choice—the non-left- Keywords packrat parsing, left recursion recursive case—will never be used. We could change the order of the choices in expr, 1. Introduction expr ::= <num> / <expr> "-" <num> Packrat parsers [2] are an attractive choice for programming lan- guage implementers because: but to no avail. Since all valid expressions begin with a number, the second choice—the left-recursive case—would never be used. For • They provide “the power and flexibility of backtracking and example, applying the expr rule to the input “1-2” would succeed unlimited look-ahead, but nevertheless [guarantee] linear parse after consuming only the “1”, and leave the rest of the input, “-2”, times.” [2] unprocessed. • They support syntactic and semantic predicates. Some packrat parser implementations, including Pappy [1] and Rats! [6], circumvent this limitation by automatically transforming • They are easy to understand: because packrat parsers only sup- directly left-recursive rules into equivalent non-left-recursive rules. port ordered choice—as opposed to unordered choice, as found This technique is called left recursion elimination. As an example, in Context-Free Grammars (CFGs)—there are no ambiguities the left-recursive rule above can be transformed to and no shift-reduce/reduce-reduce conflicts, which can be diffi- cult to resolve. expr ::= <num> ("-" <num>)* ∗ This material is based upon work supported by the National Science which is not left-recursive and therefore can be handled correctly Foundation under Grant Nos. IIS-0639876, CCF-0427202, and CCF- by a packrat parser. Note that the transformation shown here is 0545850. Any opinions, findings, and conclusions or recommendations overly simplistic; a suitable transformation must preserve the left- expressed in this material are those of the authors and do not necessarily associativity of the parse trees generated by the resulting non-left- reflect the views of the National Science Foundation. recursive rule, as well as the meaning of the original rule’s semantic actions. This paper is also available as VPRI Technical Report TR-2007-002. Now consider the following minor modification to the original grammar, which has no effect on the language accepted by expr: x ::= <expr> expr ::= <x> "-" <num> / <num> Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed When given this grammar, the Pappy packrat parser generator [1] for profit or commercial advantage and that copies bear this notice and the full citation reports the following error message: on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Illegal left recursion: x -> expr -> x PEPM’08, January 7–8, 2008, San Francisco, California, USA. Copyright c 2008 ACM 978-1-59593-977-7/08/0001. $5.00 This happens because expr is now indirectly left-recursive, and Pappy does not support indirect left recursion (also referred to as mutual left recursion). In fact, to the best of our knowledge, none APPLY-RULE(R;P) of the currently-available packrat parser implementations supports let m = MEMO(R;P) indirectly left-recursive rules. if m = NIL Although this example is certainly contrived, indirect left recur- then let ans = EVAL(R:body) sion does in fact arise in real-world grammars. For instance, Roman m new MEMOENTRY(ans;Pos) Redziejowski [8] discusses the difficulty of implementing a packrat MEMO(R;P) m parser for Java [5], whose Primary rule (for expressions) is indi- return ans rectly left-recursive with five other rules. While programmers can else Pos m:pos always refactor grammars manually in order to eliminate indirect return m:ans left recursion, doing so is tedious and error-prone, and in the end it is generally difficult to be convinced that the resulting grammar is equivalent to the original. Figure 1. The original APPLY-RULE procedure. This paper presents a modification to the memoization mecha- nism used by packrat parser implementations that enables them to EMO ( ULE; OS) ! EMO NTRY support both direct and indirect left recursion directly (i.e., without M : R P M E first having to transform rules). While it is possible for a packrat where parser with our modification to yield super-linear parse times for some left-recursive grammars, our experiments (Section 5) show MEMOENTRY : (ans : AST; pos :POS) that this is not the case for typical uses of left recursion. In other words, MEMO maps a rule-position pair (R;P) to a tuple The rest of this paper is structured as follows. Section 2 gives consisting of a brief overview of packrat parsing. Section 3 describes our mod- ification to the memoization mechanism, first showing how direct • the AST (or the special value FAIL1) resulting from applying R left recursion can be supported, and then extending the approach at position P, and to support indirect left recursion. Section 4 validates this work by • the position of the next character on the input stream. showing that it enables packrat parsers to support a grammar that closely mirrors Java’s heavily left-recursive Primary rule. Section 5 or NIL, if there is no entry in the memo table for the given rule- discusses the effects of our modification on parse times. Section 6 position pair. discusses related work, and Section 7 concludes. The APPLY-RULE procedure (see Figure 1), used in every rule application, ensures that no rule is ever evaluated more than once 2. An Overview of Packrat Parsing at a given position. When rule R is applied at position P,APPLY- RULE consults the memo table. If the memo table indicates that Packrat parsers are able to guarantee linear parse times while sup- R was previously applied at P, the appropriate parse tree node is porting backtracking and unlimited look-ahead “by saving all in- returned, and the parser’s current position is updated accordingly. termediate parsing results as they are computed and ensuring that Otherwise, APPLY-RULE evaluates the rule, stores the result in the no result is evaluated more than once” [2]. For example, consider memo table, and returns the corresponding parse tree node. what happens when the rule By using the memo table as shown in this section, packrat expr ::= <num> "+" <num> parsers are able to support backtracking and unlimited look-ahead / <num> "-" <num> while guaranteeing linear parse times. In the next section, we present modifications to the memo table and the APPLY-RULE pro- (where num matches a sequence of digits) is applied to the input cedure that make it possible for packrat parsers to support left re- “1234-5”. cursion. Since choices are always evaluated in order, our parser begins by trying to match the input with the pattern 3. Adding Support for Left Recursion <num> "+" <num> In Section 1, we showed informally that the original version of the The first term in this pattern, <num>, successfully matches the expr rule, first four characters of the input stream (“1234”). Next, the parser expr ::= <expr> "-" <num> / <num> attempts to match the next character on the input stream, “-”, with the next term in the pattern, "+". This match fails, and thus we causes packrat parsers to go into infinite recursion. We now revisit backtrack to the position at which the previous choice started (0) the same example, this time from Section 2’s more detailed point and try the second alternative: of view. <num> "-" <num> Consider what happens when expr is applied to the input “1-2-3”. Since the parser’s current position is initially 0, this appli- At this point, a conventional top-down backtracking parser would cation is encoded as APPLY-RULE(expr;0).APPLY-RULE, shown have to apply num to the input, just like we did while evaluating in Figure 1, begins by searching the parser’s memo table for the the first alternative.