Packrat Parsers Can Support Left Recursion
Total Page:16
File Type:pdf, Size:1020Kb
Packrat Parsers Can Support Left Recursion Alessandro Warth, James R. Douglass, Todd Millstein To be published as part of ACM SIGPLAN 2008 Workshop on Partial VPRI Technical Report TR-2007-002 Evaluation and Program Manipulation (PEPM ’08) January 2008. Viewpoints Research Institute, 1209 Grand Central Avenue, Glendale, CA 91201 t: (818) 332-3001 f: (818) 244-9761 Packrat Parsers Can Support Left Recursion ∗ Alessandro Warth James R. Douglass Todd Millstein University of California, Los Angeles The Boeing Company University of California, Los Angeles and Viewpoints Research Institute [email protected] [email protected] [email protected] Abstract ing [10], eliminates the need for moded lexers [9] when combin- Packrat parsing offers several advantages over other parsing tech- ing grammars (e.g., in Domain-Specific Embedded Language niques, such as the guarantee of linear parse times while supporting (DSEL) implementations). backtracking and unlimited look-ahead. Unfortunately, the limited Unfortunately, “like other recursive descent parsers, packrat support for left recursion in packrat parser implementations makes parsers cannot support left-recursion” [6], which is typically used them difficult to use for a large class of grammars (Java’s, for exam- to express the syntax of left-associative operators. To better under- ple). This paper presents a modification to the memoization mech- stand this limitation, consider the following rule for parsing expres- anism used by packrat parser implementations that makes it possi- sions: ble for them to support (even indirectly or mutually) left-recursive rules. While it is possible for a packrat parser with our modification expr ::= <expr> "-" <num> / <num> to yield super-linear parse times for some left-recursive grammars, Note that the first alternative in expr begins with expr itself. Be- our experiments show that this is not the case for typical uses of cause the choice operator in packrat parsers (denoted here by “/”) left recursion. tries each alternative in order, this recursion will never terminate: an Categories and Subject Descriptors D.3.4 [Programming Lan- application of expr will result in another application of expr with- guages]: Processors—Parsing out consuming any input, which in turn will result in yet another application of expr, and so on. The second choice—the non-left- General Terms Languages, Algorithms, Design, Performance recursive case—will never be used. We could change the order of the choices in expr, Keywords packrat parsing, left recursion expr ::= <num> / <expr> "-" <num> 1. Introduction but to no avail. Since all valid expressions begin with a number, the Packrat parsers [2] are an attractive choice for programming lan- second choice—the left-recursive case—would never be used. For guage implementers because: example, applying the expr rule to the input “1-2” would succeed after consuming only the “1”, and leave the rest of the input, “-2”, • They provide “the power and flexibility of backtracking and unprocessed. unlimited look-ahead, but nevertheless [guarantee] linear parse Some packrat parser implementations, including Pappy [1] and times.” [2] Rats! [6], circumvent this limitation by automatically transforming • They support syntactic and semantic predicates. directly left-recursive rules into equivalent non-left-recursive rules. • They are easy to understand: because packrat parsers only sup- This technique is called left recursion elimination. As an example, port ordered choice—as opposed to unordered choice, as found the left-recursive rule above can be transformed to in Context-Free Grammars (CFGs)—there are no ambiguities expr ::= <num> ("-" <num>)* and no shift-reduce/reduce-reduce conflicts, which can be diffi- cult to resolve. which is not left-recursive and therefore can be handled correctly by a packrat parser. Note that the transformation shown here is • They impose no separation between lexical analysis and pars- overly simplistic; a suitable transformation must preserve the left- ing. This feature, sometimes referred to as scannerless pars- associativity of the parse trees generated by the resulting non-left- ∗ This material is based upon work supported by the National Science Foun- recursive rule, as well as the meaning of the original rule’s semantic dation under Grant Nos. IIS-0639876, CCF-0427202, and CCF-0545850. actions. Any opinions, findings, and conclusions or recommendations expressed in Now consider the following minor modification to the original this material are those of the authors and do not necessarily reflect the views grammar, which has no effect on the language accepted by expr: of the National Science Foundation. x ::= <expr> expr ::= <x> "-" <num> / <num> When given this grammar, the Pappy packrat parser generator [1] Permission to make digital or hard copies of all or part of this work for personal or reports the following error message: classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation Illegal left recursion: x -> expr -> x on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. This happens because expr is now indirectly left-recursive, and PEPM’08, January 7–8, 2008, San Francisco, California, USA. Pappy does not support indirect left recursion (also referred to as Copyright c 2008 ACM 978-1-59593-977-7/08/0001. $5.00 mutual left recursion). In fact, to the best of our knowledge, none of the currently-available packrat parser implementations supports indirectly left-recursive rules. APPLY-RULE(R;P) Although this example is certainly contrived, indirect left recur- let m = MEMO(R;P) sion does in fact arise in real-world grammars. For instance, Roman if m = NIL Redziejowski [8] discusses the difficulty of implementing a packrat then let ans = EVAL(R:body) parser for Java [5], whose Primary rule (for expressions) is indi- m new MEMOENTRY(ans;Pos) rectly left-recursive with five other rules. While programmers can MEMO(R;P) m always refactor grammars manually in order to eliminate indirect return ans left recursion, doing so is tedious and error-prone, and in the end it else Pos m:pos is generally difficult to be convinced that the resulting grammar is return m:ans equivalent to the original. This paper presents a modification to the memoization mecha- nism used by packrat parser implementations that enables them to Figure 1. The original APPLY-RULE procedure. support both direct and indirect left recursion directly (i.e., without first having to transform rules). While it is possible for a packrat parser with our modification to yield super-linear parse times for where some left-recursive grammars, our experiments (Section 5) show MEMOENTRY : (ans : AST; pos :POS) that this is not the case for typical uses of left recursion. The rest of this paper is structured as follows. Section 2 gives In other words, MEMO maps a rule-position pair (R;P) to a tuple a brief overview of packrat parsing. Section 3 describes our mod- consisting of ification to the memoization mechanism, first showing how direct • 1 left recursion can be supported, and then extending the approach the AST (or the special value FAIL ) resulting from applying R to support indirect left recursion. Section 4 validates this work by at position P, and showing that it enables packrat parsers to support a grammar that • the position of the next character on the input stream. closely mirrors Java’s heavily left-recursive Primary rule. Section 5 or NIL, if there is no entry in the memo table for the given rule- discusses the effects of our modification on parse times. Section 6 position pair. discusses related work, and Section 7 concludes. The APPLY-RULE procedure (see Figure 1), used in every rule application, ensures that no rule is ever evaluated more than once 2. An Overview of Packrat Parsing at a given position. When rule R is applied at position P,APPLY- Packrat parsers are able to guarantee linear parse times while sup- RULE consults the memo table. If the memo table indicates that porting backtracking and unlimited look-ahead “by saving all in- R was previously applied at P, the appropriate parse tree node is termediate parsing results as they are computed and ensuring that returned, and the parser’s current position is updated accordingly. no result is evaluated more than once” [2]. For example, consider Otherwise, APPLY-RULE evaluates the rule, stores the result in the what happens when the rule memo table, and returns the corresponding parse tree node. By using the memo table as shown in this section, packrat expr ::= <num> "+" <num> parsers are able to support backtracking and unlimited look-ahead / <num> "-" <num> while guaranteeing linear parse times. In the next section, we (where num matches a sequence of digits) is applied to the input present modifications to the memo table and the APPLY-RULE pro- “1234-5”. cedure that make it possible for packrat parsers to support left re- Since choices are always evaluated in order, our parser begins cursion. by trying to match the input with the pattern <num> "+" <num> 3. Adding Support for Left Recursion In Section 1, we showed informally that the original version of the <num> The first term in this pattern, , successfully matches the expr rule, first four characters of the input stream (“1234”). Next, the parser attempts to match the next character on the input stream, “-”, with expr ::= <expr> "-" <num> / <num> the next term in the pattern, "+". This match fails, and thus we backtrack to the position at which the previous choice started (0) causes packrat parsers to go into infinite recursion. We now revisit and try the second alternative: the same example, this time from Section 2’s more detailed point of view. <num> "-" <num> Consider what happens when expr is applied to the input At this point, a conventional top-down backtracking parser would “1-2-3”.