Adaptive LL(*) Parsing: the Power of Dynamic Analysis

Adaptive LL(*) Parsing: The Power of Dynamic Analysis rtifact A Comple * t * te n * A * te Terence Parr Sam Harwell Kathleen Fisher s W i E s A e n C l l L o D C S o * * c P u e m s University of San Francisco University of Texas at Austin Tufts University E u O e e n v R t e O o d t a y * s E a * l u [email protected] [email protected] kfi[email protected] d a e t Abstract istic LALR(k) or LL(k) parser generators.1 As machine Despite the advances made by modern parsing strategies resources grew, researchers developed more powerful, but such as PEG, LL(*), GLR, and GLL, parsing is not a solved more costly, nondeterministic parsing strategies following problem. Existing approaches suffer from a number of weak- both “bottom-up” (LR-style) and “top-down” (LL-style) ap- nesses, including difficulties supporting side-effecting em- proaches. Strategies include GLR [26], Parser Expression bedded actions, slow and/or unpredictable performance, and Grammar (PEG) [9], LL(*) [20] from ANTLR 3, and re- counter-intuitive matching strategies. This paper introduces cently, GLL [25], a fully general top-down strategy. the ALL(*) parsing strategy that combines the simplicity, ef- Although these newer strategies are much easier to use ficiency, and predictability of conventional top-down LL(k) than LALR(k) and LL(k) parser generators, they suffer parsers with the power of a GLR-like mechanism to make from a variety of weaknesses. First, nondeterministic parsers parsing decisions. The critical innovation is to move gram- sometimes have unanticipated behavior. GLL and GLR re- mar analysis to parse-time, which lets ALL(*) handle any turn multiple parse trees (forests) for ambiguous grammars non-left-recursive context-free grammar. ALL(*) is O(n4) because they were designed to handle natural language in theory but consistently performs linearly on grammars grammars, which are typically ambiguous. For computer used in practice, outperforming general strategies such as languages, ambiguity is almost always an error. One can GLL and GLR by orders of magnitude. ANTLR 4 generates certainly walk a parse forest to disambiguate it, but that ALL(*) parsers and supports direct left-recursion through approach costs extra time, space, and machinery for the un- grammar rewriting. Widespread ANTLR 4 use (5000 down- common case. PEGs are unambiguous by definition but have a quirk where rule A a ab (meaning “A matches either a loads/month in 2013) provides evidence that ALL(*) is ef- ! | fective for a wide variety of applications. or ab”) can never match ab since PEGs choose the first alternative that matches a prefix of the remaining input. Nested Categories and Subject Descriptors F.4.2 Grammars and backtracking makes debugging PEGs difficult. Other Rewriting Systems [Parsing]; D.3.1 Formal Lan- Second, side-effecting programmer-supplied actions (mu- guages [syntax] tators) like print statements should be avoided in any strat- General Terms Algorithms, Languages, Theory egy that continuously speculates (PEG) or supports multiple interpretations of the input (GLL and GLR) because such ac- Keywords nondeterministic parsing, DFA, augmented tran- tions may never really take place [17]. (Though DParser [24] sition networks, grammar, ALL(*), LL(*), GLR, GLL, PEG supports “final” actions when the programmer is certain a re- duction is part of an unambiguous final parse.) Without side 1. Introduction effects, actions must buffer data for all interpretations in im- Computer language parsing is still not a solved problem in mutable data structures or provide undo actions. The former practice, despite the sophistication of modern parsing strate- mechanism is limited by memory size and the latter is not gies and long history of academic study. When machine re- always easy or possible. The typical approach to avoiding sources were scarce, it made sense to force programmers to mutators is to construct a parse tree for post-parse process- contort their grammars to fit the constraints of determin- ing, but such artifacts fundamentally limit parsing to input files whose trees fit in memory. Parsers that build parse trees cannot analyze large files or infinite streams, such as net- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed work traffic, unless they can be processed in logical chunks. for profit or commercial advantage and that copies bear this notice and the full citation Third, our experiments (Section 7) show that GLL and on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, GLR can be slow and unpredictable in time and space. Their to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. 1 OOPSLA ’14, October 20–24, 2014, Portland, OR, USA. We use the term deterministic in the way that deterministic finite automata Copyright c 2014 ACM 978-1-4503-2585-1/14/10. $15.00. (DFA) differ from nondeterministic finite automata (NFA): The next sym- http://dx.doi.org/10.1145/2660193.2660202 bol(s) uniquely determine action. 579 complexities are, respectively, O(n3) and O(np+1) where p ity and resolves it in favor of the lowest production number is the length of the longest production in the grammar [14]. associated with a surviving subparser. (Productions are num- (GLR is typically quoted as O(n3) as Kipps [15] gave such bered to express precedence as an automatic means of re- an algorithm, albeit with a constant so high as to be imprac- solving ambiguities like PEGs; Bison also resolves conflicts tical.) In theory, general parsers should handle deterministic by choosing the production specified first.) Programmers can grammars in near-linear time. In practice, we found GLL and also embed semantic predicates [22] to choose between am- GLR to be ˜135x slower than ALL(*) on a corpus of 12,920 biguous interpretations. Java 6 library source files (123M) and 6 orders of magnitude ALL(*) parsers memoize analysis results, incrementally slower on a single 3.2M Java file, respectively. and dynamically building up a cache of DFA that map looka- LL(*) addresses these weaknesses by providing a mostly head phrases to predicted productions. (We use the term deterministic parsing strategy that uses regular expressions, analysis in the sense that ALL(*) analysis yields lookahead represented as deterministic finite automata (DFA), to po- DFA like static LL(*) analysis.) The parser can make fu- tentially examine the entire remaining input rather than the ture predictions at the same parser decision and lookahead fixed k-sequences of LL(k). Using DFA for lookahead lim- phrase quickly by consulting the cache. Unfamiliar input its LL(*) decisions to distinguishing alternatives with regular phrases trigger the grammar analysis mechanism, simultane- lookahead languages, even though lookahead languages (set ously predicting an alternative and updating the DFA. DFA of all possible remaining input phrases) are often context- are suitable for recording prediction results, despite the fact free. But the main problem is that the LL(*) grammar con- that the lookahead language at a given decision typically dition is statically undecidable and grammar analysis some- forms a context-free language. Dynamic analysis only needs times fails to find regular expressions that distinguish be- to consider the finite context-free language subsets encoun- tween alternative productions. ANTLR 3’s static analysis de- tered during a parse and any finite set is regular. tects and avoids potentially-undecidable situations, failing To avoid the exponential nature of nondeterministic sub- over to backtracking parsing decisions instead. This gives parsers, prediction uses a graph-structured stack (GSS) [25] LL(*) the same a ab quirk as PEGs for such decisions. to avoid redundant computations. GLR uses essentially the | Backtracking decisions that choose the first matching alter- same strategy except that ALL(*) only predicts productions native also cannot detect obvious ambiguities such A with such subparsers whereas GLR actually parses with ! ↵ ↵ where ↵ is some sequence of grammar symbols that them. Consequently, GLR must push terminals onto the GSS | makes ↵ ↵ non-LL(*). but ALL(*) does not. | ALL(*) parsers handle the task of matching terminals and expanding nonterminals with the simplicity of LL but have 1.1 Dynamic grammar analysis O(n4) theoretical time complexity (Theorem 6.3) because In this paper, we introduce Adaptive LL(*) , or ALL(*) , in the worst-case, the parser must make a prediction at each parsers that combine the simplicity of deterministic top- input symbol and each prediction must examine the entire re- down parsers with the power of a GLR-like mechanism to maining input; examining an input symbol can cost O(n2). make parsing decisions. Specifically, LL parsing suspends O(n4) is in line with the complexity of GLR. In Section 7, at each prediction decision point (nonterminal) and then re- we show empirically that ALL(*) parsers for common lan- sumes once the prediction mechanism has chosen the ap- guages are efficient and exhibit linear behavior in practice. propriate production to expand. The critical innovation is The advantages of ALL(*) stem from moving grammar to move grammar analysis to parse-time; no static gram- analysis to parse time, but this choice places an additional mar analysis is needed. This choice lets us avoid the un- burden on grammar functional testing. As with all dynamic decidability of static LL(*) grammar analysis and lets us approaches, programmers must cover as many grammar generate correct parsers (Theorem 6.1) for any non-left- position and input sequence combinations as possible to recursive context-free grammar (CFG).

Load more