Adaptive LL(*) : The Power of Dynamic Analysis

rtifact A Comple * t * te n * A * te Terence Parr Sam Harwell Kathleen Fisher s W i E s A e n C l l

L o D

C

S o

*

*

c

P

u

e

m s

University of San Francisco University of Texas at Austin Tufts University E

u

O

e

e

n

v

R

t

e

O o

d

t

a

y *

s E a

*

l

u [email protected] [email protected] kfi[email protected] d a e t

Abstract istic LALR(k) or LL(k) parser generators.1 As machine Despite the advances made by modern parsing strategies resources grew, researchers developed more powerful, but such as PEG, LL(*), GLR, and GLL, parsing is not a solved more costly, nondeterministic parsing strategies following problem. Existing approaches suffer from a number of weak- both “bottom-up” (LR-style) and “top-down” (LL-style) ap- nesses, including difficulties supporting side-effecting em- proaches. Strategies include GLR [26], Parser Expression bedded actions, slow and/or unpredictable performance, and Grammar (PEG) [9], LL(*) [20] from ANTLR 3, and re- counter-intuitive matching strategies. This paper introduces cently, GLL [25], a fully general top-down strategy. the ALL(*) parsing strategy that combines the simplicity, ef- Although these newer strategies are much easier to use ficiency, and predictability of conventional top-down LL(k) than LALR(k) and LL(k) parser generators, they suffer parsers with the power of a GLR-like mechanism to make from a variety of weaknesses. First, nondeterministic parsers parsing decisions. The critical innovation is to move gram- sometimes have unanticipated behavior. GLL and GLR re- mar analysis to parse-time, which lets ALL(*) handle any turn multiple parse trees (forests) for ambiguous grammars non-left-recursive context-free grammar. ALL(*) is O(n4) because they were designed to handle natural language in theory but consistently performs linearly on grammars grammars, which are typically ambiguous. For computer used in practice, outperforming general strategies such as languages, ambiguity is almost always an error. One can GLL and GLR by orders of magnitude. ANTLR 4 generates certainly walk a parse forest to disambiguate it, but that ALL(*) parsers and supports direct left-recursion through approach costs extra time, space, and machinery for the un- grammar rewriting. Widespread ANTLR 4 use (5000 down- common case. PEGs are unambiguous by definition but have a quirk where rule A a ab (meaning “A matches either a loads/month in 2013) provides evidence that ALL(*) is ef- ! | fective for a wide variety of applications. or ab”) can never match ab since PEGs choose the first alter- native that matches a prefix of the remaining input. Nested Categories and Subject Descriptors F.4.2 Grammars and makes debugging PEGs difficult. Other Rewriting Systems [Parsing]; D.3.1 Formal Lan- Second, side-effecting programmer-supplied actions (mu- guages [syntax] tators) like print statements should be avoided in any strat- General Terms Algorithms, Languages, Theory egy that continuously speculates (PEG) or supports multiple interpretations of the input (GLL and GLR) because such ac- Keywords nondeterministic parsing, DFA, augmented tran- tions may never really take place [17]. (Though DParser [24] sition networks, grammar, ALL(*), LL(*), GLR, GLL, PEG supports “final” actions when the programmer is certain a re- duction is part of an unambiguous final parse.) Without side 1. Introduction effects, actions must buffer data for all interpretations in im- Computer language parsing is still not a solved problem in mutable data structures or provide undo actions. The former practice, despite the sophistication of modern parsing strate- mechanism is limited by memory size and the latter is not gies and long history of academic study. When machine re- always easy or possible. The typical approach to avoiding sources were scarce, it made sense to force programmers to mutators is to construct a for post-parse process- contort their grammars to fit the constraints of determin- ing, but such artifacts fundamentally limit parsing to input files whose trees fit in memory. Parsers that build parse trees cannot analyze large files or infinite streams, such as net- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed work traffic, unless they can be processed in logical chunks. for profit or commercial advantage and that copies bear this notice and the full citation Third, our experiments (Section 7) show that GLL and on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, GLR can be slow and unpredictable in time and space. Their to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. 1 OOPSLA ’14, October 20–24, 2014, Portland, OR, USA. We use the term deterministic in the way that deterministic finite automata Copyright c 2014 ACM 978-1-4503-2585-1/14/10. . . $15.00. (DFA) differ from nondeterministic finite automata (NFA): The next sym- http://dx.doi.org/10.1145/2660193.2660202 bol(s) uniquely determine action.

579 complexities are, respectively, O(n3) and O(np+1) where p ity and resolves it in favor of the lowest production number is the length of the longest production in the grammar [14]. associated with a surviving subparser. (Productions are num- (GLR is typically quoted as O(n3) as Kipps [15] gave such bered to express precedence as an automatic means of re- an algorithm, albeit with a constant so high as to be imprac- solving ambiguities like PEGs; Bison also resolves conflicts tical.) In theory, general parsers should handle deterministic by choosing the production specified first.) Programmers can grammars in near-linear time. In practice, we found GLL and also embed semantic predicates [22] to choose between am- GLR to be ˜135x slower than ALL(*) on a corpus of 12,920 biguous interpretations. Java 6 library source files (123M) and 6 orders of magnitude ALL(*) parsers memoize analysis results, incrementally slower on a single 3.2M Java file, respectively. and dynamically building up a cache of DFA that map looka- LL(*) addresses these weaknesses by providing a mostly head phrases to predicted productions. (We use the term strategy that uses regular expressions, analysis in the sense that ALL(*) analysis yields lookahead represented as deterministic finite automata (DFA), to po- DFA like static LL(*) analysis.) The parser can make fu- tentially examine the entire remaining input rather than the ture predictions at the same parser decision and lookahead fixed k-sequences of LL(k). Using DFA for lookahead lim- phrase quickly by consulting the cache. Unfamiliar input its LL(*) decisions to distinguishing alternatives with regular phrases trigger the grammar analysis mechanism, simultane- lookahead languages, even though lookahead languages (set ously predicting an alternative and updating the DFA. DFA of all possible remaining input phrases) are often context- are suitable for recording prediction results, despite the fact free. But the main problem is that the LL(*) grammar con- that the lookahead language at a given decision typically dition is statically undecidable and grammar analysis some- forms a context-free language. Dynamic analysis only needs times fails to find regular expressions that distinguish be- to consider the finite context-free language subsets encoun- tween alternative productions. ANTLR 3’s static analysis de- tered during a parse and any finite set is regular. tects and avoids potentially-undecidable situations, failing To avoid the exponential nature of nondeterministic sub- over to backtracking parsing decisions instead. This gives parsers, prediction uses a graph-structured stack (GSS) [25] LL(*) the same a ab quirk as PEGs for such decisions. to avoid redundant computations. GLR uses essentially the | Backtracking decisions that choose the first matching alter- same strategy except that ALL(*) only predicts productions native also cannot detect obvious ambiguities such A with such subparsers whereas GLR actually parses with ! ↵ ↵ where ↵ is some sequence of grammar symbols that them. Consequently, GLR must push terminals onto the GSS | makes ↵ ↵ non-LL(*). but ALL(*) does not. | ALL(*) parsers handle the task of matching terminals and expanding nonterminals with the simplicity of LL but have 1.1 Dynamic grammar analysis O(n4) theoretical time complexity (Theorem 6.3) because In this paper, we introduce Adaptive LL(*) , or ALL(*) , in the worst-case, the parser must make a prediction at each parsers that combine the simplicity of deterministic top- input symbol and each prediction must examine the entire re- down parsers with the power of a GLR-like mechanism to maining input; examining an input symbol can cost O(n2). make parsing de- cisions. Specifically, LL parsing suspends O(n4) is in line with the complexity of GLR. In Section 7, at each prediction decision point (nonterminal) and then re- we show empirically that ALL(*) parsers for common lan- sumes once the prediction mechanism has chosen the ap- guages are efficient and exhibit linear behavior in practice. propriate production to expand. The critical innovation is The advantages of ALL(*) stem from moving grammar to move grammar analysis to parse-time; no static gram- analysis to parse time, but this choice places an additional mar analysis is needed. This choice lets us avoid the un- burden on grammar functional testing. As with all dynamic decidability of static LL(*) grammar analysis and lets us approaches, programmers must cover as many grammar generate correct parsers (Theorem 6.1) for any non-left- position and input sequence combinations as possible to recursive context-free grammar (CFG). While static analysis find grammar ambiguities. Standard source code coverage must consider all possible input sequences, dynamic analy- tools can help programmers measure grammar coverage for sis need only consider the finite collection of input sequences ALL(*) parsers. High coverage in the generated code corre- actually seen. sponds to high grammar coverage. The idea behind the ALL(*) prediction mechanism is to The ALL(*) algorithm is the foundation of the ANTLR 4 launch subparsers at a decision point, one per alternative parser generator (ANTLR 3 is based upon LL(*)). ANTLR 4 production. The subparsers operate in pseudo-parallel to ex- was released in January 2013 and gets about 5000 download- plore all possible paths. Subparsers die off as their paths s/month (source, binary, or ANTLRworks2 development fail to match the remaining input. The subparsers advance environment, counting non-robot entries in web logs with through the input in lockstep so analysis can identify a sole unique IP addresses to get a lower bound.) Such activity survivor at the minimum lookahead depth that uniquely pre- provides evidence that ALL(*) is useful and usable. dicts a production. If multiple subparsers coalesce together or reach the end of file, the predictor announces an ambigu-

580 The remainder of this paper is organized as follows. We grammar Ex; // generates class ExParser begin by introducing the ANTLR 4 parser generator (Sec- // action defines ExParser member: enum_is_keyword @members boolean enum_is_keyword = true; tion 2) and discussing the ALL(*) parsing strategy (Sec- { } tion 3). Next, we define predicated grammars, their aug- stat: expr ’=’ expr ’;’ // production 1 | expr ’;’ // production 2 mented transition network representation, and lookahead ; DFA (Section 4). Then, we describe ALL(*) grammar anal- expr: expr ’*’ expr ysis and present the parsing algorithm itself (Section 5). | expr ’+’ expr Finally, we support our claims regarding ALL(*) correctness | expr ’(’ expr ’)’ // f(x) (Section 6) and efficiency (Section 7) and examine related | id work (Section 8). Appendix A has proofs for key ALL(*) ; theorems, Appendix B discusses algorithm pragmatics, Ap- id : ID | !enum_is_keyword ? ’enum’ ; { } pendix C has left-recursion elimination details. ID : [A-Za-z]+ ; // match id with upper, lowercase WS : [ t r n]+ -> skip ; // ignore whitespace \ \ \ 2. ANTLR 4 Figure 1. Sample left-recursive ANTLR 4 predicated-grammar ANTLR 4 accepts as input any context-free grammar that does not contain indirect or hidden left-recursion.2 From the grammar, ANTLR 4 generates a recursive-descent parser ter the parse based upon previously-discovered information. that uses an ALL(*) production prediction function (Sec- For example, a C grammar could have embedded actions typedef int tion 3). ANTLR currently generates parsers in Java or C#. to define type symbols from constructs, like i32; ANTLR 4 grammars use yacc-like syntax with extended , and predicates to distinguish type names from other i32 x; BNF (EBNF) operators such as Kleene star (*) and token identifiers in subsequent definitions like . literals in single quotes. Grammars contain both lexical and 2.1 Sample grammar syntactic rules in a combined specification for convenience. ANTLR 4 generates both a lexer and a parser from the com- Figure 1 illustrates ANTLRs yacc-like metalanguage by giv- bined specification. By using individual characters as input ing the grammar for a simple programming language with symbols, ANTLR 4 grammars can be scannerless and com- assignment and expression statements terminated by semi- posable because ALL(*) languages are closed under union colons. There are two grammar features that render this (Theorem 6.2), providing the benefits of modularity de- grammar non-LL(*) and, hence, unacceptable to ANTLR 3. scribed by Grimm [10]. (We will henceforth refer to ANTLR First, rule expr is left recursive. ANTLR 4 automatically 4 as ANTLR and explicitly mark earlier versions.) rewrites the rule to be non-left-recursive and unambiguous, Programmers can embed side-effecting actions (muta- as described in Section 2.2. Second, the alternative produc- tors), written in the host language of the parser, in the gram- tions of rule stat have a common recursive prefix (expr), mar. The actions have access to the current state of the which is sufficient to render stat undecidable from an LL(*) parser. The parser ignores mutators during speculation to perspective. ANTLR 3 would detect recursion on production prevent actions from “launching missiles” speculatively. Ac- left edges and fail over to a backtracking decision at runtime. tions typically extract information from the input stream and Predicate !enum is keyword ? in rule id allows or { } create data structures. disallows enum as a valid identifier according to the predi- ANTLR also supports semantic predicates, which are cate at the moment of prediction. When the predicate is false, side-effect free Boolean-valued expressions written in the the parser treats id as id : ID ; disallowing enum as an id host language that determine the semantic viability of a par- as the lexer matches enum as a separate token from ID. This ticular production. Semantic predicates that evaluate to false example demonstrates how predicates allow a single gram- during the parse render the surrounding production nonvi- mar to describe subsets or variations of the same language. able, dynamically altering the language generated by the grammar at parse-time.3 Predicates significantly increase the 2.2 Left-recursion removal strength of a parsing strategy because predicates can exam- The ALL(*) parsing strategy itself does not support left- ine the parse stack and surrounding input context to pro- recursion, but ANTLR supports direct left-recursion through vide an informal context-sensitive parsing capability. Se- grammar rewriting prior to parser generation. Direct left- mantic actions and predicates typically work together to al- recursion covers the most common cases, such as arithmetic expression productions, like E E.id, and C declara- 2 ! Indirectly left-recursive rules call themselves through another rule; e.g., tors. We made an engineering decision not to support indi- A B, B A. Hidden left-recursion occurs when an empty production ! ! rect or hidden left-recursion because these forms are much exposes left recursion; e.g., A BA, B ✏. ! ! less common and removing all left recursion can lead to 3 Previous versions of ANTLR supported syntactic predicates to disam- biguate cases where static grammar analysis failed; this facility is not exponentially-big transformed grammars. For example, the needed in ANTLR4 because of ALL(*)’s dynamic analysis. C11 language specification grammar contains lots of direct

581 void stat() { // parse according to rule stat switch (adaptivePredict("stat", call stack)) { left-recursion but no indirect or hidden recursion. See Ap- case 1: // predict production 1 pendix 2.2 for more details. expr(); match(’=’); expr(); match(’;’); break; 2.3 Lexical analysis with ALL(*) case 2: // predict production 2 expr(); match(’;’); break; ANTLR uses a variation of ALL(*) for lexing that fully } matches tokens instead of just predicting productions like } ALL(*) parsers do. After warm-up, the lexer will have built a Figure 2. Recursive-descent code for stat in grammar Ex DFA similar to what regular-expression based tools such as lex would create statically. The key difference is that ALL(*) ɛ expr '=' expr ';' ɛ lexers are predicated context-free grammars not just regular stat p expressions so they can recognize context-free tokens such ɛ expr ';' ɛ as nested comments and can gate tokens in and out according q to semantic context. This design is possible because ALL(*) is fast enough to handle lexing as well as parsing. Figure 3. ATN for ANTLR rule stat in grammar Ex ALL(*) is also suitable for because of its recognition power, which comes in handy for context- adaptivePredict takes a nonterminal and parser call stack as sensitive lexical problems like merging C and SQL lan- parameters and returns the predicted production number or guages. Such a union has no clear lexical sentinels demar- throws an exception if there is no viable production. For ex- cating lexical regions: ample, rule stat from the example in Section 2.1 yields a int next = select ID from users where name=’Raj’+1; parsing procedure similar to Figure 2. int from = 1, select = 2; ALL(*) prediction has a structure similar to the well- int x = select * from; known NFA-to-DFA subset construction algorithm. The See grammar extras/CSQL in [19] for a proof of concept. goal is to discover the set of states the parser could reach after having seen some or all of the remaining input rela- 3. Introduction to ALL(*) parsing tive to the current decision. As in subset construction, an ALL(*) DFA state is the set of parser configurations possi- In this section, we explain the ideas and intuitions behind ble after matching the input leading to that state. Instead of ALL(*) parsing. Section 5 will then present the algorithm an NFA, however, ALL(*) simulates the actions of an aug- more formally. The strength of a top-down parsing strategy mented recursive transition network (ATN) [27] represen- is related to how the strategy chooses which alternative pro- tation of the grammar since ATNs closely mirror grammar duction to expand for the current nonterminal. Unlike LL(k) structure. (ATNs look just like syntax diagrams that can have and LL(*) parsers, ALL(*) parsers always choose the first al- actions and semantic predicates.) LL(*)’s static analysis also ternative that leads to a valid parse. All non-left-recursive operates on an ATN for the same reason. Figure 3 shows the grammars are therefore ALL(*). ATN submachine for rule stat. Instead of relying on static grammar analysis, an ALL(*) An ATN configuration represents the execution state of parser adapts to the input sentences presented to it at parse- a subparser and tracks the ATN state, predicted production time. The parser analyzes the current decision point (nonter- number, and ATN subparser call stack: tuple (p, i, ).4 Con- minal with multiple productions) using a GLR-like mecha- figurations include production numbers so prediction can nism to explore all possible decision paths with respect to the identify which production matches the current lookahead. current “call” stack of in-process nonterminals and the re- Unlike static LL(*) analysis, ALL(*) incrementally builds maining input on-demand. The parser incrementally and dy- DFA considering just the lookahead sequences it has seen namically builds a lookahead DFA per decision that records instead of all possible sequences. a mapping from lookahead sequence to predicted produc- When parsing reaches a decision for the first time, adap- tion number. If the DFA constructed to date matches the cur- tivePredict initializes the lookahead DFA for that decision rent lookahead, the parser can skip analysis and immediately by creating a DFA start state, D . D is the set of ATN sub- expand the predicted alternative. Experiments in Section 7 0 0 parser configurations reachable without consuming an input show that ALL(*) parsers usually get DFA cache hits and symbol starting at each production left edge. For example, that DFA are critical to performance. construction of D for nonterminal stat in Figure 3 would Because ALL(*) differs from deterministic top-down 0 first add ATN configurations (p, 1, []) and (q, 2, []) where p methods only in the prediction mechanism, we can con- and q are ATN states corresponding to production 1 and 2’s struct conventional recursive-descent LL parsers but with left edges and [] is the empty subparser call stack (if stat is an important twist. ALL(*) parsers call a special predic- the start symbol). tion function, adaptivePredict, that analyzes the grammar to construct lookahead DFA instead of simply comparing 4 Component i does not exist in the machine configurations of GLL, GLR, the lookahead to a statically-computed token set. Function or Earley [8].

582 ID = D0 s1 :1 Without the parser stack, no amount of lookahead can ID = ( ID ) ; D0 s1 :1 :2 uniquely distinguish between A’s productions. Lookahead ba predicts A b when B invokes A but predicts A ✏ (a) After x=y; (b) After x=y; and f(x); ! ! when C invokes A. If prediction ignores the parser call stack, Figure 4. Prediction DFA for decision stat there is a prediction conflict upon ba. Parsers that ignore the parser call stack for prediction are called Strong LL (SLL) parsers. The recursive-descent Analysis next computes a new DFA state indicating parsers programmers build by hand are in the SLL class. where ATN simulation could reach after consuming the first By convention, the literature refers to SLL as LL but we lookahead symbol and then connects the two DFA states distinguish the terms since “real” LL is needed to handle all with an edge labeled with that symbol. Analysis contin- grammars. The above grammar is LL(2) but not SLL(k) ues, adding new DFA states, until all ATN configurations for any k, though duplicating A for each call site renders the in a newly-created DFA state predict the same production: grammar SLL(2). ( ,i, ). Function adaptivePredict marks that state as an Creating a different lookahead DFA for each possible accept state and returns to the parser with that production parser call stack is not feasible as the number of stack permu- number. Figure 4a shows the lookahead DFA for decision tations is exponential in the stack depth. Instead, we take ad- stat after adaptivePredict has analyzed input sentence vantage of the fact that most decisions are not stack-sensitive x=y; = = . The DFA does not look beyond because is suffi- and build DFA ignoring the parser call stack. If SLL ATN expr cient to uniquely distinguish ’s productions. (Notation simulation finds a prediction conflict (Section 5.3), it can- :1 means “predict production 1.”) not be sure if the lookahead phrase is ambiguous or stack- In the typical case, adaptivePredict finds an existing sensitive. In this case, adaptivePredict must re-examine the DFA for a particular decision. The goal is to find or build lookahead using the parser stack 0. This hybrid or opti- a path through the DFA to an accept state. If adaptivePredict mized LL mode improves performance by caching stack- reaches a (non-accept) DFA state without an edge for the insensitive prediction results in lookahead DFA when pos- current lookahead symbol, it reverts to ATN simulation to sible while retaining full stack-sensitive prediction power. extend the DFA (without rewinding the input). For example, Optimized LL mode applies on a per-decision basis, but two- stat f(x); to analyze a second input phrase for , such as , stage parsing, described next, can often avoid LL simula- ID adaptivePredict finds an existing edge from the D0 and tion completely. (We henceforth use SLL to indicate stack- jumps to s1 without ATN simulation. There is no existing insensitive parsing and LL to indicate stack-sensitive.) edge from s1 for the left parenthesis so analysis simulates the ATN to complete a path to an accept state, which predicts the second production, as shown in Figure 4b. Note that be- 3.2 Two-stage ALL(*) parsing cause sequence ID(ID) predicts both productions, analysis SLL is weaker but faster than LL. Since we have found that continues until the DFA has edges for the = and ; symbols. most decisions are SLL in practice, it makes sense to attempt If ATN simulation computes a new target state that al- parsing entire inputs in “SLL only mode,” which is stage one ready exists in the DFA, simulation adds a new edge target- of the two-stage ALL(*) parsing algorithm. If, however, SLL ing the existing state and switches back to DFA simulation mode finds a syntax error, it might have found an SLL weak- mode starting at that state. Targeting existing states is how ness or a real syntax error, so we have to retry the entire input cycles can appear in the DFA. Extending the DFA to han- using optimized LL mode, which is stage two. This coun- dle unfamiliar phrases empirically decreases the likelihood terintuitive strategy, which potentially parses entire inputs of future ATN simulation, thereby increasing parsing speed twice, can dramatically increase speed over the single-stage (Section 7). optimized LL mode stage. For example, two-stage parsing with the Java grammar (Section 7) is 8x faster than one-stage 3.1 Predictions sensitive to the call stack optimized LL mode to parse a 123M corpus. The two-stage Parsers cannot always rely upon lookahead DFA to make strategy relies on the fact that SLL either behaves like LL correct decisions. To handle all non-left-recursive grammars, or gets an error (Theorem 6.5). For invalid sentences, there ALL(*) prediction must occasionally consider the parser call is no derivation for the input regardless of how the parser stack available at the start of prediction (denoted 0 in Sec- chooses productions. For valid sentences, SLL chooses pro- tion 5). To illustrate the need for stack-sensitive predictions, ductions as LL would or picks a production that ultimately consider that predictions made while recognizing a Java leads to an error (LL finds that choice nonviable). Even in the method definition might depend on whether the method was presence of ambiguities, SLL often resolves conflicts as LL defined within an interface or class definition. (Java interface would. For example, despite a few ambiguities in our Java methods cannot have bodies.) Here is a simplified grammar grammar, SLL mode correctly parses all inputs we have tried that exhibits a stack-sensitive decision in nonterminal A: without failing over to LL. Nonetheless, the second (LL) S xB yC B Aa C Aba A b ✏ stage must remain to ensure correctness. ! | ! ! ! |

583 A N Nonterminal 4. Predicated grammars, ATNs, and DFA 2 a, b, c, d T Terminal To formalize ALL(*) parsing, we first need to recall some 2 X (N T ) Production element background material, specifically, the formal definitions of 2 [ ↵,, X⇤ Sequence of grammar symbols 2 predicate grammars, ATNs, and Lookahead DFA. u, v, w, x, y T ⇤ Sequence of terminals 2 4.1 Predicated grammars ✏ Empty string $ End of file “symbol” To formalize ALL(*) parsing, we first need to formally define ⇡ ⇧ Predicate in host language 2 the predicated grammars from which they are derived. A µ Action in host language 2M predicated grammar G =(N,T,P,S,⇧, ) has elements: (N ⇧ ) Reduction label M ~ 2 [ [M N is the set of nonterminals (rule names) = 1..n Sequence of reduction labels • Production Rules: T is the set of terminals (tokens) th • A ↵i i context-free production of A P is the set of productions ! th • A ⇡i ? ↵i i production predicated on semantics S N is the start symbol !{ } th • 2 A µi i production with mutator ⇧ is a set of side-effect-free semantic predicates !{ } • is a set of actions (mutators) • M Figure 5. Predicated Grammar Notation Predicated ALL(*) grammars differ from those of LL(*) [20] only in that ALL(*) grammars do not need or support Prod A ↵ (S,uA) ! (S,u↵) syntactic predicates. Predicated grammars in the formal sec- ⇡(S) ) tions of this paper use the notation shown in Figure 5. The A ⇡ ? ↵ A µ Sem !{ } Action !{ } (S,uA) (S,u↵) (S,uA) (µ(S),u) derivation rules in Figure 6 define the meaning of a pred- ) ) icated grammar. To support semantic predicates and muta- (S,↵) (S0,↵0), (S0,↵0) ⇤ (S00,) Closure ) ) (S,↵) ⇤ (S00,) tors, the rules reference state S, which abstracts user state ) during parsing. The judgment form (S,↵) (S0,) means: ) Figure 6. Predicated Grammar Leftmost Derivation Rules “In machine state S, grammar sequence ↵ reduces in one step to modified state S0 and grammar sequence .” The judgment (S,↵) ⇤ (S0,) denotes repeated applications Figure 6 do not preclude ambiguity. However, for a practical ) of the one-step reduction rule. These reduction rules specify programing language parser, each input should correspond a leftmost derivation. A production with semantic predicate to a unique parse. To that end, ALL(*) uses the order of the ⇡i is viable only if ⇡i is true of current state S. Finally, an productions in a rule to resolve ambiguities in favor of the action production uses the specified mutator µi to update S. production with the lowest number. This strategy is a concise Formally, the language generated by grammar sequence way for programmers to resolve ambiguities automatically ↵ in user state S is L(S,↵)= w (S,↵) ⇤ (S0,w) and and resolves the well-known if-then-else ambiguity in the { | ) } the language of grammar G is L(S0,G)= w (S0,S) ⇤ usual way by associating the else with the most recent if. { | ) (S,w) for initial user state S0 (S0 can be empty). If u is a PEGs and Bison parsers have the same resolution policy. } prefix of w or equal to w, we write u w. Language L is To resolve ambiguities that depend on current state S, ALL(*) iff there exists an ALL(*) grammar for L. Theoreti- programmers can insert semantic predicates but must make cally, the language class of L(G) is recursively enumerable them mutually exclusive for all ambiguous input sequences because each mutator could be a Turing machine. In reality, to render such productions unambiguous. Statically, mutual grammar writers do not use this generality so it is standard exclusion cannot be enforced because predicates are writ- practice to consider the language class to be the context- ten in a Turing-complete language. At parse-time, however, sensitive languages instead. The class is context-sensitive ALL(*) evaluates predicates and dynamically reports input rather than context-free as predicates can examine the call phrases for which multiple, predicated productions are vi- stack and terminals to the left and right. able. If the programmer fails to satisfy mutual exclusivity, This formalism has various syntactic restrictions not ALL(*) uses production order to resolve the ambiguity. present in actual ANTLR grammars, for example, forcing mutators into their own rules and disallowing the common 4.3 Augmented transition networks + Extended BNF (EBNF) notation such as ↵⇤ and ↵ closures. Given predicated grammar G=(N,T,P,S,⇧, ), the cor- We can make these restrictions without loss of generality be- M responding ATN MG =(Q, ⌃, ,E,F) has elements [20]: cause any grammar in the general form can be translated into Q is the set of states this more restricted form. • ⌃ is the edge alphabet N T ⇧ • [ [ [M 4.2 Resolving ambiguity is the transition relation mapping Q (⌃ ✏) Q • ⇥ [ ! An is one in which the same input E Q= p A N is set of submachine entry states • 2 { A | 2 } sequence can be recognized in multiple ways. The rules in F Q = p0 A N is set of submachine final states • 2 { A | 2 }

584 Input Grammar Element Resulting ATN Transitions cusses a critical graph data structure before presenting the ✏ ✏ ✏ A ↵i pA pA,i ↵i pA0 functions themselves. We finish with an example of how the ! !✏ !⇡i ! ✏ A ⇡i ? ↵i pA pA,i ↵i p0 algorithm works. !{ } ! ! ! A ✏ µi Parsing begins with function parse that behaves like A µi pA pA,i pA0 !{ } ! ✏ !✏ a conventional top-down LL(k) parse function except that A ✏i pA pA,i pA0 ! ! ! ALL(*) parsers predict productions with a special function ↵i = X1X2 ...Xm X1 X2 Xm p0 p1 ... pm called adaptivePredict, instead of the usual “switch on next for Xj N T, j =1..m ! ! ! 2 [ k token types” mechanism. Function adaptivePredict simu- Figure 7. Predicated Grammar to ATN transformation lates an ATN representation of the original predicated gram- mar to choose an ↵i production to expand for decision point A ↵1 ... ↵n. Conceptually, prediction is a function of p A p c p p a p A p ε S,1 1 2 ε ε A,1 5 6 ε ! | | p p' p p' the current parser call stack 0, remaining input wr, and S ε ε S A ε A p A d p p b ε S,2 p3 4 A,2 p7 user state S if A has predicates. For efficiency, prediction ig- nores 0 when possible (Section 3.1) and uses the minimum Figure 8. ATN for G with P = S Ac Ad, A aA b lookahead from wr. { ! | ! | } To avoid repeating ATN simulations for the same input and nonterminal, adaptivePredict assembles DFA that mem- ATNs resemble the syntax diagrams used to document oize input-to-predicted-production mappings, one DFA per programming languages, with an ATN submachine for each nonterminal. Recall that each DFA state, D, is the set of nonterminal. Figure 7 shows how to construct the set of ATN configurations possible after matching the lookahead states Q and edges from grammar productions. The start symbols leading to that state. Function adaptivePredict calls state for A is pA Q and targets pA,i, created from the left 2 startState to create initial DFA state, D0, and then SLLpre- edge of ↵i, with an edge in . The last state created from A dict to begin simulation. ↵ targets p0 . Nonterminal edges p q are like function i A ! Function SLLpredict adds paths to the lookahead DFA calls. They transfer control of the ATN to A’s submachine, that match some or all of wr through repeated calls to target. q pushing return state onto a state call stack so it can continue Function target computes DFA target state D0 from current from q after reaching the stop state for A’s submachine, pA0 . state D using move and closure operations similar to those Figure 8 gives the ATN for a simple grammar. The language found in subset construction. Function move finds all ATN matched by the ATN is the same as the language of the configurations reachable on the current input symbol and original grammar. closure finds all configurations reachable without traversing a terminal edge. The primary difference from subset con- 4.4 Lookahead DFA struction is that closure simulates the call and return of ATN ALL(*) parsers record prediction results obtained from ATN submachines associated with nonterminals. simulation with lookahead DFA, which are DFA augmented If SLL simulation finds a conflict (Section 5.3), SLLpre- with accept states that yield predicted production numbers. dict rewinds the input and calls LLpredict to retry prediction, There is one accept state per production of a decision. this time considering 0. Function LLpredict is similar to Definition 4.1. Lookahead DFA are DFA with augmented SLLpredict but does not update a nonterminal’s DFA because accept states that yield predicted production numbers. For DFA must be stack-insensitive to be applicable in all stack predicated grammar G =(N,T,P,S,⇧, ), DFA M = contexts. Conflicts within ATN configuration sets discovered M (Q, ⌃, ,D0,F) where: by LLpredict represent ambiguities. Both prediction func- tions use getConflictSetsPerLoc to detect conflicts, which are Q is the set of states • configurations representing the same parser location but dif- ⌃=T is the edge alphabet • ferent productions. To avoid failing over to LLpredict un- is the transition function mapping Q ⌃ Q • ⇥ ! necessarily, SLLpredict uses getProdSetsPerState to see if a D Q is the start state • 0 2 potentially non-conflicting DFA path remains when getCon- F Q = f ,f ,...,f final states, one f per prod. i • 2 { 1 2 n} i flictSetsPerLoc reports a conflict. If so, it is worth continuing with SLLpredict on the chance that more lookahead will re- A transition in from state p to state q on symbol a ⌃ a a 2 solve the conflict without recourse to full LL parsing. has the form p q and we require p q0 implies q = q0. ! ! Before describing these functions in detail, we review a fundamental graph data structure that they use to efficiently 5. ALL(*) parsing algorithm manage multiple call stacks a la GLL and GLR. With the definitions of grammars, ATNs, and lookahead DFA formalized, we can present the key functions of the 5.1 Graph-structured call stacks ALL(*) parsing algorithm. This section starts with a sum- The simplest way to implement ALL(*) prediction would be mary of the functions and how they fit together then dis- a classic backtracking approach, launching a subparser for

585 each ↵i. The subparsers would consume all remaining input 5.2 ALL(*) parsing functions because backtracking subparsers do not know when to stop We can now present the key ALL(*) functions, which we parsing—they are unaware of other subparsers’ status. The have highlighted in boxes and interspersed within the text independent subparsers would also lead to exponential time of this section. Our discussion follows a top-down order complexity. We address both issues by having the prediction and assumes that the ATN corresponding to grammar G, the subparsers advance in lockstep through w . Prediction ter- r semantic state S, the DFA under construction, and input are minates after consuming prefix u w when all subparsers r in scope for all functions of the algorithm and that semantic but one die off or when prediction identifies a conflict. Oper- predicates and actions can directly access S. ating in lockstep also provides an opportunity for subparsers Function parse. The main entry point is function parse to share call stacks thus avoiding redundant computations. (shown in Function 1), which initiates parsing at the start Two subparsers at ATN state p that share the same ATN symbol, argument S. The function begins by initializing a stack top, q1 and q2, will mirror each other’s behavior un- simulation call stack to an empty stack and setting ATN til simulation pops q from their stacks. Prediction can treat state “cursor” p to pS,i, the ATN state on the left edge of those subparsers as a single subparser by merging stacks. S’s production number i predicted by adaptivePredict. The We merge stacks for all configurations in a DFA state of the function loops until the cursor reaches pS0 , the submachine form (p, i, 1) and (p, i, 2), forming a general configuration stop state for S. If the cursor reaches another submachine (p, i, ) with graph-structured stack (GSS) [25] =1 2 ] stop state, pB0 , parse simulates a “return” by popping the where means graph merge. can be the empty stack [],a ] return state q off the call stack and moving p to q. special stack # used for SLL prediction (addressed shortly), an individual stack, or a graph of stack nodes. Merging in- S dividual stacks into a GSS reduces the potential size from Function 1: parse( ) [] i S, p p exponential to linear complexity (Theorem 6.4). To repre- := ; := adaptivePredict( ); := S,i; sent a GSS, we use an immutable graph data structure with while true do maximal sharing of nodes. Here are two examples that share if p = pB0 (i.e., p is a rule stop state) then if B = S (finished matching start rule S) then the parser stack 0 at the bottom of the stack: q return; pq else let = q0 in := 0; p := q; p0 q0 = q0 q00 = 0 ] 0 ] else t switch t where p q do 0 ! In the functions that follow, all additions to configuration case b: (i.e., terminal symbol transition) sets, such as with operator +=, implicitly merge stacks. if b = input.curr() then There is a special case related to the stack condition at p := q; input.advance(); the start of prediction. must distinguish between an empty else parse error; case B: := q; i := adaptivePredict(B,); stack and no stack information. For LL prediction, the initial p:=pB,i; ATN simulation stack is the current parser call stack . The 0 case µ: S := µ(S); p := q; initial stack is only empty, 0 =[], when the decision entry case ⇡: if ⇡(S) then p := q else parse error; rule is the start symbol. Stack-insensitive SLL prediction, on case ✏: p := q; the other hand, ignores the parser call stack and uses an ini- endsw tial stack of #, indicating no stack information. This distinc- tion is important when computing the closure (Function 7) of For p not at a stop state, parse processes ATN transition t configurations representing submachine stop states. Without p q. There can be only one transition from p because of ! parser stack information, a subparser that returns from deci- the way ATNs are constructed. If t is a terminal edge and sion entry rule A must consider all possible invocation sites; matches the current input symbol, parse transitions the edge i.e., closure sees configuration (p0 , , #). A and moves to the next symbol. If t is a nonterminal edge The empty stack [] is treated like any other node for LL referencing some B, parse simulates a submachine call by prediction: [] yields the graph equivalent of set , [] , ] { } pushing return state q onto the stack and choosing the appro- meaning that both and the empty stack are possible. Push- priate production left edge in B by calling adaptivePredict ing state p onto [] yields p[] not p because popping p must and setting the cursor appropriately. For action edges, parse leave the [] empty stack symbol. For SLL prediction, #= ] updates the state according to the mutator µ and transitions # for any graph because # acts like a wildcard and rep- to q. For predicate edges, parse transitions only if predicate resents the set of all stacks. The wildcard therefore contains ⇡ evaluates to true. During the parse, failed predicates be- any . Pushing state p onto # yields p#. have like mismatched tokens. Upon an ✏ edge, parse moves to q. Function parse does not explicitly check that parsing stops at end-of-file because applications like development environments need to parse input subphrases.

586 Function adaptivePredict. To predict a production, parse Function 3: startState(A, ) returns DFA State D0 calls adaptivePredict (Function 2), which is a function of D := ; 0 ? ✏ the decision nonterminal A and the current parser stack foreach pA pA,i ATN do ✏ ! ⇡i 2 0. Because prediction only evaluates predicates during full if pA pA,i p then ⇡ := ⇡i else ⇡ := ✏; LL adaptivePredict LLpredict ! ! simulation, delegates to if at if ⇡=✏ or eval(⇡i) then 5 least one of the productions is predicated. For decisions D +=closure( ,D ,(p ,i,)); 0 {} 0 A,i that do not yet have a DFA, adaptivePredict creates DFA return D0; df a A with start state D0 in preparation for SLLpredict to add DFA paths. D0 is the set of ATN configurations reachable in SLLpredict finds an existing edge emanating from DFA without traversing a terminal edge. Function adaptivePre- state cursor D upon a or computes a new one via target. dict also constructs the set of final states FDFA, which con- It is possible that target will compute a target state that tains one final state fi for each production of A. The set of already exists in the DFA, D0, in which case function target DFA states, QDFA, is the union of D0, FDFA, and the er- returns D0 because D0 may already have outgoing edges ror state Derror. Vocabulary ⌃DFA is the set of grammar computed; it is inefficient to discard work by replacing D0. terminals T . For unpredicated decisions with existing DFA, At the next iteration, SLLpredict will consider edges from adaptivePredict calls SLLpredict to obtain a prediction from D0, effectively switching back to DFA simulation. the DFA, possibly extending the DFA through ATN simula- tion in the process. Finally, since adaptivePredict is looking Function 4: SLLpredict(A, D0, start, 0) returns int prod ahead not parsing, it must undo any changes made to the a := input.curr(); D = D0; input cursor, which it does by capturing the input index as while true do a start upon entry and rewinding to start before returning. let D0 be DFA target D D0; ! if @ D0 then D0 := target(D, a); Function 2: adaptivePredict(A, 0) returns int alt if D0 = Derror then parse error; start := input.index(); // checkpoint input if D0 stack sensitive then if A ⇡i↵i then input.seek(start); LLpredict(A, start, ); 9 ! return 0 alt := LLpredict(A, start, 0); if D0 = fi FDFA then return i; 2 input.seek(start); // undo stream position changes D := D0; a := input.next(); return alt; if @ dfaA then Once SLLpredict acquires a target state, D0, it checks for D0 := startState(A, #); errors, stack sensitivity, and completion. If target marked FDFA := fi fi := DFA State(i) A ↵i ; D0 is stack-sensitive, prediction requires full LL simulation { | 8 ! } and SLLpredict calls LLpredict. If D0 is accept state fi, as QDFA := D0 FDFA Derror; [ [ determined by target, SLLpredict returns i. In this case, all dfaA:= DFA(QDFA, ⌃DFA = T,DFA = ?,D0,FDFA); the configurations in D predicted the same production i; alt := SLLpredict(A, D , start, ); 0 0 0 further analysis is unnecessary and the algorithm can stop. input.seek(start); // undo stream position changes For any other D , the algorithm sets D to D , gets the next return alt; 0 0 symbol, and repeats. . Using a combined move-closure opera- Function startState. To create DFA start state D0, start- Function target tion, target discovers the set of ATN configurations reach- State (Function 3) adds configurations (pA,i,i,) for each A ↵ and A ⇡ ↵ , if ⇡ evaluates to true. When called able from D upon a single terminal symbol a T . Function ! i ! i i i 2 from adaptivePredict, call stack argument is special sym- move computes the configurations reachable directly upon a bol # needed by SLL prediction, indicating “no parser stack by traversing a terminal edge: information.” When called from LLpredict, is parser stack a move(D, a)= (q, i, ) p q, (p, i, ) D 0. Computing closure of the configurations completes D0. { | ! 2 }

Function SLLpredict. Function SLLpredict (Function 4) Those configurations and their closure form D0. If D0 is performs both DFA and SLL ATN simulation, incrementally empty, no alternative is viable because none can match a adding paths to the DFA. In the best case, there is already from the current state so target returns error state Derror. If a DFA path from D0 to an accept state, fi, for prefix u all configurations in D0 predict the same production number a wr and some production number i. In the worst-case, ATN i, target adds edge D f and returns accept state f . If ! i i simulation is required for all a in sequence u. The main loop D0 has conflicting configurations, target marks D0 as stack- sensitive. The conflict could be an ambiguity or a weakness 5 SLL prediction does not incorporate predicates for clarity in this exposi- stemming from SLL’s lack of parser stack information. (Con- tion, but in practice, ANTLR incorporates predicates into DFA accept states (Section B.2). ANTLR 3 DFA used predicated edges not predicated accept flicts along with getConflictSetsPerLoc and getProdSetsPer- states. State are described in Section 5.3.) The function finishes by

587 adding state D0, if an equivalent state, D0, is not already in the call and return of submachines. Function closure treats µ a the DFA, and adding edge D D0. and ⇡ edges as ✏ edges because mutators should not be ex- ! ecuted during prediction and predicates are only evaluated Function 5: target(D, a) returns DFA State D0 during start state computation. From parameter c =(p, i, ) ✏ mv := move(D, a); and edge p q, closure adds (q, i, ) to local working set ! B D0 := closure( , c); C. For submachine call edge p q, closure adds the clo- c mv {} ! 2 a sure of (pB,i,q). Returning from a submachine stop state if D0 =S? then DFA += D Derror; return Derror; ! pB0 adds the closure of configuration (q, i, ) in which case if j ( ,j, ) D0 = i then { | 2a } { } c would have been of the form (pB0 ,i,q). In general, a con- DFA += D fi; return fi; // Predict rule i ! figuration stack is a graph representing multiple individual // Look for a conflict among configurations of D 0 stacks. Function closure must simulate a return from each of a conflict := the stack tops. The algorithm uses notation q0 to alts getConflictSetsPerLoc(D0): alts >1; 2 9 2 | | represent all stack tops q of . To avoid non-termination due viablealt := alts getProdSetsPerState(D0): alts =1; to SLL right recursion and ✏ edges in subrules such as ()+, a conflict and9 not2 viablealt | | if then closure uses a busy set shared among all closure operations mark D0 as stack sensitive; used to compute the same D0. if D0 = D0 QDFA then D0 := D0; else QDFA += D0; 2a When closure reaches stop state pA0 for decision entry += D D0; DFA ! rule, A, LL and SLL predictions behave differently. LL pre- D ; return 0 diction pops from the parser call stack 0 and “returns” to the state that invoked A’s submachine. SLL prediction, on . Upon SLL simulation conflict, SLL- Function LLpredict the other hand, has no access to the parser call stack and predict rewinds the input and calls LLpredict (Function 6) to must consider all possible A invocation sites. Function clo- get a prediction based upon LL ATN simulation, which con- sure finds =#(and p0 = p0 ) in this situation because siders the full parser stack . Function LLpredict is similar B A 0 startState will have set the initial stack as # not . The re- to SLLpredict. It uses DFA state D as a cursor and state D 0 0 turn behavior at the decision entry rule is what differentiates as the transition target for consistency but does not update SLL from LL parsing. A’s DFA so SLL prediction can continue to use the DFA. LL prediction continues until either D0 = ?, D0 uniquely Function 7: closure(busy, c =(p, i, )) returns set C predicts an alternative, or D0 has a conflict. If D0 from LL if c busy then return ?; else busy += c; 2 simulation has a conflict as SLL did, the algorithm reports C := c ; { } the ambiguous phrase (input from start to the current index) if p = pB0 (i.e., p is any stop state including pA0 ) then and resolves to the minimum production number among the if =#(i.e., stack is SLL wildcard) then conflicting configurations. (Section 5.3 explains ambiguity C += closure(busy, (p2,i,#)); // call site B detection.) Otherwise, cursor D moves to D0 and considers p2 : p1 p2 ATN 8 S 2 the next input symbol. closure! else // nonempty SLL or LL stack LLpredict(A, start, ) int alt for q0 (i.e., each stack top q in graph ) do Function 6: 0 returns 2 D := D0 := startState(A, 0); C += closure(busy, (q, i, 0)); // “return” to q C; while true do return mv := move(D, input.curr()); end edge foreach p q do D0 := closure( , c); ! c mv {} switch edge do 2 if D0 =S? then parse error; case B: C += closure(busy, (pB,i,q)); if j ( ,j, ) D0 = i then return i; case ⇡, µ, ✏: C += closure(busy, (q, i, )); { | 2 } { } /* If all p, pairs predict > 1 alt and all such return C; production sets are same, input ambiguous. */ altsets := getConflictSetsPerLoc(D ); 0 5.3 Conflict and ambiguity detection if x, y altsets, x = y and x > 1 then 8 2 | | x := any set in altsets; The notion of conflicting configurations is central to ALL(*) report ambiguous alts x at start..input.index(); analysis. Conflicts trigger failover to full LL prediction dur- return min(x); ing SLL prediction and signal an ambiguity during LL pre- D := D0; input.advance(); diction. A sufficient condition for a conflict between con- figurations is when they differ only in the predicted alter- Function closure. The closure operation (Function 7) native: (p, i, ) and (p, j, ). Detecting conflicts is aided by chases through all ✏ edges reachable from p, the ATN state two functions. The first, getConflictSetsPerLoc (Function 8), projected from configuration parameter c and also simulates collects the sets of production numbers associated with all

588 (p, , ) configurations. If a p, pair predicts more than a Conflicts during LL simulation are ambiguities and oc- single production, a conflict exists. Here is a sample config- cur when each conflict set from getConflictSetsPerLoc con- uration set and the associated set of conflict sets: tains more than 1 production—every location in D is reach-

(p, 1, ), (p, 2, ), (p, 3, ), (p, 1, 0), (p, 2, 0), (r, 2, 00) able from more than a 1 production. Once multiple sub- { } parsers reach the same (p, , ), all future simulation de- 1,2,3 1,2 2 { } { } { } rived from (p, , ) will behave identically. More lookahead These| conflict{z sets indicate} that| location{z p, }is| reachable{z } will not resolve the ambiguity. Prediction could terminate at from productions 1, 2, 3 , p, 0 is reachable from produc- this point and report lookahead prefix u as ambiguous but { } tions 1, 2 , and r, 00 is reachable from production 2 . LLpredict continues until it is sure for which productions { } { } u is ambiguous. Consider conflict sets 1,2,3 and 2,3 . { } { } Because both have degree greater than one, the sets repre- // For each p, get set of alts i from (p, , ) D { } 2 sent an ambiguity, but additional input will identify whether configs u w is ambiguous upon 1,2,3 or 2,3 . Function LL- Function 8: getConflictSetsPerLoc(D) returns set of sets r { } { } s := ?; predict continues until all conflict sets that identify ambigu- for (p, , ) D do prds:= i (p, i, ) ; s := s prds; ities are equal; condition x = y and x > 1 x, y altsets | | 8 2 returns; 2 { | } [ embodies this test. To detect conflicts, the algorithm compares graph-structured The second function, getProdSetsPerState (Function 9), stacks frequently. Technically, a conflict occurs when config- is similar but collects the production numbers associated urations (p, i, ) and (p, j, 0) occur in the same configura- with just ATN state p. For the same configuration set, get- tion set with i = j and at least one stack trace in common 6 ProdSetsPerState computes these conflict sets: to both and 0. Because checking for graph intersection is

(p, 1, ), (p, 2, ), (p, 3, ), (p, 1, 0), (p, 2, 0), (r, 2, 00) expensive, the algorithm uses equality, =0, as a heuris- { } tic. Equality is much faster because of the shared subgraphs. 1,2,3 2 { } { } The graph equality algorithm can often check node iden- | A sufficient condition{z for failing over to LL} |prediction{z } tity to compare two entire subgraphs. In the worst case, the (LLpredict) from SLL would be when there is at least one equality versus subset heuristic delays conflict detection un- set of conflicting configurations: getConflictSetsPerLoc re- til the GSS between conflicting configurations are simple turns at least one set with more than a single production linear stacks where graph intersection is the same as graph number. E.g., configurations (p, i, ) and (p, j, ) exist in equality. The cost of this heuristic is deeper lookahead. parameter D. However, our goal is to continue using SLL prediction as long as possible because SLL prediction up- 5.4 Sample DFA construction dates the lookahead DFA cache. To that end, SLL predic- tion continues if there is at least one nonconflicting config- To illustrate algorithm behavior, consider inputs bc and uration (when getProdSetsPerState returns at least one set bd for the grammar and ATN in Figure 8. ATN simula- of size 1). The hope is that more lookahead will lead to a tion for decision S launches subparsers at left edge nodes configuration set that predicts a unique production via that pS,1 and pS,2 with initial D0 configurations (pS,1, 1, []) and nonconflicting configuration. For example, the decision for (pS,2, 2, []). Function closure adds three more configurations S a a a p b is ambiguous upon a between productions 1 to D0 as it “calls” A with “return” nodes p1 and p3. Here is ! | | the DFA resulting from ATN simulation upon bc and then bd and 2 but isq unambiguous upon ab. (Location p is the ATN (configurations added by move are bold): state between a and b.) After matching input aq, the config- uration set would be (pS0 , 1, []), (pS0 , 2, []), (p, 3, []) . Func- (pS,1, 1, []), (pA, 1, p1), (pA,1, 1,p1), (pA,2, 1,p1) { } D0 tion getConflictSetsPerLoc returns 1, 2 , 3 . The next (pS,2, 2, []), (pA, 2,p3), (pA,1, 2,p3), (pA,2, 2,p3) {{ } { }} move-closure upon b leads to nonconflicting configuration b set (pS0 , 3, []) from (p, 3, []), bypassing the conflict. If all (p7, 1, p1), (pA0 , 1, p1), (p1, 1, []) { } D0 sets returned from getConflictSetsPerLoc predict more than (p7, 2, p3), (pA0 , 2,p3), (p3, 2, []) one alternative, no amount of lookahead will lead to a unique c d prediction. Analysis must try again with call stack 0 via LL- f1 (p2, 1, []), (pS0 , 1, []) (p4, 2, []), (pS0 , 2, []) f2 predict. After bc prediction, the DFA has states D0, D0, and f1. From DFA state D0, closure reaches the end of A and pops from // For each p return set of alts i from (p, , ) D S f 2 the stacks, returning to ATN states in . State 1 uniquely configs. predicts production number 1. State f2 is created and con- Function 9: getProdSetsPerState(D) returns set of sets nected to the DFA (shown with dashed arrow) during pre- s := ; ? diction of the second phrase, bd. Function adaptivePredict for (p, , ) D do prds := i (p, i, ) ; s := s prds; returns; 2 { | } [ first uses DFA simulation to get to D0 from D0 upon b. Be-

589 25 22.05s fore having seen bd, D0 has no d edge so adaptivePredict 7ests build trees when Parked † 800 GL5 anG GLL767s d 20 700 d e must use ATN simulation to add edge D0 f2. t 600 a13 ! a 487s 15 g mLn m 500 i n 12.16s t i s d s e e 400 r a8 t a

10 8.08s a 6. Theoretical results p 7.65s 300 mLn e m i r 5.71s 5.86s 4.73s 5.23s t s

3arse tiPe (sec) tiPe 3arse 200 This section identifies the key ALL(*) theorems and shows 5 3.73s e 98s 100 parser time complexity. See Appendix A for detailed proofs. 25s 0 0 † † † † † † † † r l c ! nG e L5 D a 54 ts CC u rs G sc av L a le R D J6 D Theorem 6.1. (Correctness). The ALL(*) parser for non- A1TL54J A1TL541T JavaCC A1TL535 E lkh D3 5 A ElNhRund Sa E left-recursive G recognizes sentence w iff w L(G). 2 Theorem 6.2. ALL(*) languages are closed under union. Figure 9. Comparing Java parse times for 10 tools and 8 strategies on Java 6 Library and compiler source code (smaller Theorem 6.3. ALL(*) parsing n symbols has O(n4) time. is faster). 12,920 files, 3.6M lines, size 123M. Tool descriptors comprise: “tool name version [strategy].” ANTLR4 4.1 [ALL(*)]; Theorem 6.4. A GSS has O(n) nodes for n input symbols. Javac 7 [handbuilt recursive-descent and precedence parser for ex- Theorem 6.5. Two-stage parsing for non-left-recursive G pressions]; JavaCC 5.0 [LL(k)]; Elkhound 2005.08.22b [GLR] recognizes sentence w iff w L(G). (tested on Windows); ANTLR3 3.5 [LL(*)]; Rats! 2.3.1 [PEG]; 2 SableCC 3.7 [LALR(1)]; DParser 1.2 [GLR]; JSGLR (from 7. Empirical results Spoofax) 1.1 [GLR]; Rascal 0.6.1 [GLL]. Tests run 10x with gen- erous memory, average/stddev computed on last 8 to avoid JIT cost. We performed experiments to compare the performance of Error bars are negligible but show some variability due to garbage ALL(*) Java parsers with other strategies, to examine ALL(*) collection. To avoid a log scale, we use a separate graph for GLR, throughput for a variety of other languages, to highlight the GLL parse times. effect of the lookahead DFA cache on parsing speed, and to provide evidence of linear ALL(*) performance in practice. reported Elkhound times are the Windows time multiplied 7.1 Comparing ALL(*)’s speed to other parsers by that OS X to Windows ratio. Our first experiment compared Java parsing speed across 10 For this experiment, ALL(*) outperforms the other parser tools and 8 parsing strategies: hand-tuned recursive-descent generators and is only about 20% slower than the hand- with precedence parsing, LL(k), LL(*), PEG, LALR(1), built parser in the Java compiler itself. When comparing ALL(*) GLR, and GLL. Figure 9 shows the time for each runs with tree construction (marked with in Figure 9), † tool to parse the 12,920 source files of the Java 6 library and ANTLR 4 is about 4.4x faster than Elkhound, the fastest compiler. We chose Java because it was the most commonly GLR tool we tested, and 135x faster than GLL (Rascal). available grammar among tools and sample Java source is ANTLR 4’s nondeterministic ALL(*) parser was slightly plentiful. The Java grammars used in this experiment came faster than JavaCC’s deterministic LL(k) parser and about directly from the associated tool except for DParser and 2x faster than Rats!’s PEG. In a separate test, we found that Elkhound, which did not offer suitable Java grammars. We ALL(*) outperforms Rats! on its own PEG grammar con- ported ANTLR’s Java grammar to the meta-syntax of those verted to ANTLR syntax (8.77s vs 12.16s). The LALR(1) tools using unambiguous arithmetic expressions rules. We parser did not perform well against the LL tools but that also embedded merge actions in the Elkhound grammar to could be SableCC’s implementation rather than a deficiency disambiguate during the parse to mimic ANTLR’s ambigu- of LALR(1). (The Java grammar from JavaCUP, another ity resolution. All input files were loaded into RAM before LALR(1) tool, was incomplete and unable to parse the cor- parsing and times reflect the average time measured over 10 pus.) When reparsing the corpus, ALL(*) lookahead gets complete corpus passes, skipping the first two to ensure JIT cache hits at each decision and parsing is 30% faster at 3.73s. compiler warm-up. For ALL(*), we used the two-stage parse When reparsing with tree construction (time not shown), from Section 3.2. The test machine was a 6 core 3.33Ghz ALL(*) outperforms handbuilt Javac (4.4s vs 4.73s). Repars- 16G RAM Mac OS X 10.7 desktop running the Java 7 virtual ing speed matters to tools like development environments. machine. Elkhound and DParser parsers were implemented The GLR parsers we tested are up to two orders of mag- in C/C++, which does not have a garbage collector running nitude slower at Java parsing than ALL(*). Of the GLR tools, concurrently. Elkhound was last updated in 2005 and no Elkhound has the best performance primarily because it re- longer builds on Linux or OS X, but we were able to build it lies on a linear LR(1) stack instead of a GSS whenever on Windows 7 (4 core 2.67Ghz 24G RAM). Elkhound also possible. Further, we allowed Elkhound to disambiguate can only read from a file so Elkhound parse times are not during the parse like ALL(*). Elkhound uses a separate comparable. In an effort to control for machine speed dif- lexer, unlike JSGLR and DParser, which are scannerless. ferences and RAM vs SSD, we computed the time ratio of A possible explanation for the observed performance dif- our Java test rig on our OS X machine reading from RAM ference with ALL(*) is that the Java grammar we ported to to the test rig running on Windows pulling from SSD. Our Elkhound and DParser is biased towards ALL(*), but this

590 Tool Time RAM (M) Erlang Derived from LALR(1) grmr. 500 preproc’d files, 8M. Javac† 89 ms 7 • ANTLR4 201 ms 8 Some of these grammars yield reasonable but much JavaCC 206 ms 7 slower parse times compared to Java and XML but demon- ANTLR4† 360 ms 8 strate that programmers can convert a language specifica- ANTLR3 1048 ms 143 tion to ANTLR’s meta-syntax and get a working grammar SableCC† 1,174 ms 201 without major modifications. (In our experience, grammar Rats!† 1,365 ms 22 specifications are rarely tuned to a particular tool or pars- JSGLR† 15.4 sec 1,030 ing strategy and are often ambiguous.) Later, programmers Rascal† 24.9 sec 2,622 (no DFA) ANTLR4 42.5 sec 27 can use ANTLR’s profiling and diagnostics to improve per- elkhounda 3.35 min 3 formance, as with any programming task. For example, the DParser† 10.5 hours 100+ C11 specification grammar is LL not SLL because of rule elkhound† out of mem 5390+ declarationSpecifiers, which we altered to be SLL in our C grammar (getting a 7x speed boost). Figure 10. Time and space to parse and optionally build trees for 3.2M Java file. Space is median reported after GC dur- 7.3 Effect of lookahead DFA on performance ing parse using -XX:+PrintGC option (process monitoring for The lookahead DFA cache is critical to ALL(*) performance. C++). Times include lexing; all input preloaded. †Building trees. aDisambiguating during the parse, no trees, estimated time. To demonstrate the cache’s effect on parsing speed, we dis- abled the DFA and repeated our Java experiments. Consider the 3.73s parse time from Figure 9 to reparse the Java cor- objection is not well-founded. GLR should also benefit from pus with pure cache hits. With the lookahead DFA cache dis- highly-deterministic and unambiguous grammars. GLL has abled completely, the parser took 12 minutes (717.6s). Fig- the slowest speed here perhaps because Rascal’s team ported ure 10 shows that disabling the cache increases parse time SDF’s GLR Java grammar, which is not optimized for GLL from 203ms to 42.5s on the 3.2M file. This performance is (Grammar variations can affect performance.) Rascal is also in line with the high cost of GLL and GLR parsers that also scannerless and is currently the only available GLL tool. do not reduce parser speculation by memoizing parsing de- The biggest issue with general algorithms is that they are cisions. As an intermediate value, clearing the DFA cache highly unpredictable in time and space, which can make before parsing each corpus file yields a total time of 34s in- them unsuitable for some commercial applications. Fig- stead of 12 minutes. This isolates cache use to a single file ure 10 summarizes the performance of the same tools against and demonstrates that cache warm-up occurs quickly even a single 3.2M Java file. Elkhound took 7.65s to parse the within a single file. 123M Java corpus, but took 3.35 minutes to parse the 3.2M DFA size increases linearly as the parser encounters new Java file. It crashed (out of memory) with parse forest con- lookahead phrases. Figure 12 shows the growth in the num- struction on. DParser’s time jumped from a corpus time of ber of DFA states as the (slowest four) parsers from Fig- 98s to 10.5 hours on the 3.2M file. The speed of Rascal ure 11 encounter new files. Languages like C that have con- and JSGLR scale reasonably well to the 3.2M file, but use structs with common left-prefixes require deep lookahead 2.6G and 1G RAM, respectively. In contrast, ALL(*) parses in LL parsers to distinguish phrases; e.g., struct ... x; the 3.2M file in 360ms with tree construction using 8M. { } and struct ... f(); share a large left-prefix. In contrast, ANTLR 3 is fast but is slower and uses more memory (due { } the Verilog2001 parser uses very few DFA states (but runs to backtracking ) than ANTLR 4. slower due to a non-SLL rule). Similarly, after seeing the en- 7.2 ALL(*) performance across languages tire 123M Java corpus, the Java parser uses just 31,626 DFA states, adding an average of ˜2.5 states per file parsed. DFA Figure 11 gives the bytes-per-second throughput of ALL(*) size does, however, continue to grow as the parser encoun- parsers for 8 languages, including Java for comparison. The ters unfamiliar input. Programmers can clear the cache and number of test files and file sizes vary greatly (according to ALL(*) will adapt to subsequent input. the input we could reasonably collect); smaller files yield higher parse-time variance. 7.4 Empirical parse-time complexity C Derived from C11 specification; has no indirect left-recursion, • Given the wide range of throughput in Figure 11, one could altered stack-sensitive rule to render SLL (see text below): 813 suspect nonlinear behavior for the slower parsers. To inves- preprocessed files, 159.8M source from postgres database. tigate, we plotted parse time versus file size in Figure 13 and Verilog2001 Derived from Verilog 2001 spec, removed indirect • left-recursion: 385 files, 659k from [3] and web. drew least-squares regression and LOWESS [6] data fitting JSON Derived from spec. 4 files, 331k from twitter. curves. LOWESS curves are parametrically unconstrained • DOT: Derived from spec. 48 files 19.5M collected from web. (not required to be a line or any particular polynomial) and • Lua: Derived from Lua 5.2 spec. 751 files, 123k from github. they virtually mirror each regression line, providing strong • XML Derived from spec. 1 file, 117M from XML benchmark. evidence that the relationship between parse time and in- •

591 Grammar KB/sec 8. Related work XML 45,993 ErlDng For decades, researchers have worked towards increasing Java 24,972 100 C the recognition strength of efficient but non-general LL and JSON 17,696 80 DOT 16,152 60 LR parsers and increasing the efficiency of general algo- 3 Lua 5,698 40 rithms such as Earley’s O(n ) algorithm [8]. Parr [21] and C 4,238 20 LuD Charles [4] statically generated LL(k) and LR(k) parsers for VerLlog2001

Verilog2001 1,994 1000 x VtDteV DFA 1o. 0 k>1. Parr and Fisher’s LL(*) [20] and Bermudez and 0 100 200 300 400 500 600 700 800 Erlang 751 FLleV pDrVed Schimpf’s LAR(m) [2] statically computed LL and LR Figure 11. Throughput in Figure 12. DFA growth rate parsers augmented with cyclic DFA that could examine arbi- KByte/s. Lexing+parsing; all vs number of files parsed. Files trary amounts of lookahead. These parsers were based upon input preloaded in RAM. parsed in disk order. LL-regular [12] and LR-regular [7] parsers, which have the undesirable property of being undecidable in general. In- troducing backtracking dramatically increases recognition 1000 14 740 CLL fLles 12 385 Verilog2001 fileV strength and avoids static grammar analysis undecidability 800 10 600 8 issues but is undesirable because it has embedded mutators 400 6 4 issues, reduces performance, and complicates single-step 200 2 0 0 debugging. Packrat parsers (PEGs) [9] try decision produc- 0 100 200 300 400 500 600 0 1 2 3 4 5 6 tions in order and pick the first that succeeds. PEGs are O(n) 100 12 80 500 (rlang files 10 642 Lua fLles because they memoize partial parsing results but suffer from 60 8 a ab ab 6 the quirk where is silently unmatchable. 40 | RegressLRn 4 To improve general parsing performance, Tomita [26] in- 20 L2WE66 2

Parse tiPe (Ps) tiPe Parse 0 0 troduced GLR, a general algorithm based upon LR(k) that 0 20 40 60 80 100 120 140 0 5 10 15 20 25 30 35 )ile size (KB) )Lle sLze (KB) conceptually forks subparsers at each conflicting LR(k) state at parse-time to explore all possible paths. Tomita shows Figure 13. Linear parse time vs file size. Linear regression GLR to be 5x-10x faster than Earley. A key component of (dashed line) and LOWESS unconstrained curve coincide, giving GLR parsing is the graph-structured stack (GSS) [26] that strong evidence of linearity. Curves computed on bottom 99% of prevents parsing the same input twice in the same way. (GLR file sizes but zooming in for detail in the bottom 40% of parse times. pushes input symbols and LR states on the GSS whereas ALL(*) pushes ATN states.) Elkhound [18] introduced hy- brid GLR parsers that use a single stack for all LR(1) deci- put size is linear. The same methodology shows that the sions and a GSS when necessary to match ambiguous por- parser generated from the non-SLL grammar (not shown) tions of the input. (We found Elkhound’s parsers to be faster taken from the C11 specification is also linear, despite be- than those of other GLR tools.) GLL [25] is the LL analog ing much slower than our SLL version. of GLR and also uses subparsers and a GSS to explore all We have yet to see nonlinear behavior in practice but the possible paths; GLL uses k =1lookahead where possible theoretical worst-case behavior of ALL(*) parsing is O(n4). for efficiency. GLL is O(n3) and GLR is O(np+1) where p Experimental parse-time data for the following contrived is the length of the longest grammar production. worst-case grammar exhibits quartic behavior for input a, Earley parsers scale gracefully from O(n) for determin- aa, aaa, ..., an (with n 120 symbols we could test in a  istic grammars to O(n3) in the worst case for ambiguous reasonable amount of time). S A $, A aAA aA a. ! ! | | grammars but performance is not good enough for general The generated parser requires a prediction upon each input use. LR(k) state machines can improve the performance of symbol and each prediction must examine all remaining in- such parsers by statically computing as much as possible. put. The closure operation performed for each input symbol LRE [16] is one such example. Despite these optimizations, must examine the entire depth of the GSS, which could be general algorithms are still very slow compared to determin- size n. Finally, merging two GSS can take O(n) in our im- istic parsers augmented with deep lookahead. plementation, yielding O(n4) complexity. The problem with arbitrary lookahead is that it is impos- From these experiments, we conclude that shifting gram- sible to compute statically for many useful grammars (the mar analysis to parse-time to get ALL(*) strength is not LL-regular condition is undecidable.) By shifting lookahead only practical but yields extremely efficient parsers, compet- analysis to parse-time, ALL(*) gains the power to handle any ing with the hand-tuned recursive-descent parser of the Java grammar without left recursion because it can launch sub- compiler. Memoizing analysis results with DFA is critical parsers to determine which path leads to a valid parse. Unlike to such performance. Despite O(n4) theoretical complexity, GLR, speculation stops when all remaining subparsers are ALL(*) appears to be linear in practice and does not exhibit associated with a single alternative production, thus, com- the unpredictable performance or large memory footprint of puting the minimum lookahead sequence. To get perfor- the general algorithms.

592 mance, ALL(*) records a mapping from that lookahead se- sacrificing the flexibility and simplicity of recursive-descent quence to the predicted production using a DFA for use by parsers desired by programmers. subsequent decisions. The context-free language subsets en- countered during a parse are finite and, therefore, ALL(*) 10. Acknowledgments lookahead languages are regular. Ancona et al [1] also per- We thank Elizabeth Scott, Adrian Johnstone, and Mark John- formed parse-time analysis, but they only computed fixed son for discussions on parsing complexity. Eelco Visser and LR(k) lookahead and did not adapt to the actual input as Jurgen Vinju provided parser test code from JSGLR and ALL(*) does. Perlin [23] operated on an RTN like ALL(*) Rascal. Etienne Gagnon generated SableCC parsers. and computed k =1lookahead during the parse. ALL(*) is similar to Earley in that both are top-down and References operate on a representation of the grammar at parse-time, [1] ANCONA,M.,DODERO,G.,GIANUZZI,V.,AND MOR- but Earley is parsing not computing lookahead DFA. In that GAVI, M. Efficient construction of LR(k) states and tables. sense, Earley is not performing grammar analysis. Earley ACM Trans. Program. Lang. Syst. 13, 1 (Jan. 1991), 150–178. also does not manage an explicit GSS during the parse. In- [2] BERMUDEZ,M.E.,AND SCHIMPF, K. M. Practical arbitrary stead, items in Earley states have “parent pointers” that re- lookahead LR parsing. Journal of Computer and System fer to other states that, when threaded together, form a GSS. Sciences 41, 2 (1990). Earley’s SCANNER operation corresponds to ALL(*)’s move [3] BROWN,S.,AND VRANESIC,Z. Fundamentals of Digital function. The PREDICTOR and COMPLETER operations Logic with Verilog Design. McGraw-Hill series in ECE. 2003. correspond to push and pop operations in ALL(*)’s closure [4] CHARLES,P. A Practical Method for Constructing Efficient function. An Earley state is the set of all parser configu- LALR(k) Parsers with Automatic Error Recovery. PhD thesis, rations reachable at some absolute input depth whereas an New York University, New York, NY, USA, 1991. ALL(*) DFA state is a set of configurations reachable from a [5] CLARKE, K. The top-down parsing of expressions. Unpub- lookahead depth relative to the current decision. Unlike the lished technical report, Dept. of Computer Science and Statis- completely general algorithms, ALL(*) seeks a single parse tics, Queen Mary College, London, June 1986. of the input, which allows the use of an efficient LL stack [6] CLEVELAND, W. S. Robust Locally Weighted Regression during the parse. and Smoothing Scatterplots. Journal of the American Statisti- Parsing strategies that continuously speculate or sup- cal Association 74 (1979), 829–836. port ambiguity have difficulty with mutators because they [7] COHEN,R.,AND CULIK,K. LR-Regular grammars—an are hard to undo. A lack of mutators reduces the gen- extension of LR(k) grammars. In SWAT ’71 (Washington, erality of semantic predicates that alter the parse as they DC, USA, 1971), IEEE Computer Society, pp. 153–165. cannot test arbitrary state computed previously during the [8] EARLEY, J. An efficient context-free parsing algorithm. Com- parse. Rats! [10] supports restricted semantic predicates and munications of the ACM 13, 2 (1970), 94–102. Yakker [13] supports semantic predicates that are functions [9] FORD, B. Parsing Expression Grammars: A recognition- of previously-parsed terminals. Because ALL(*) does not based syntactic foundation. In POPL (2004), ACM Press. speculate during the actual parse, it supports arbitrary muta- [10] GRIMM, R. Better extensibility through modular syntax. In tors and semantic predicates. Space considerations preclude PLDI (2006), ACM Press, pp. 38–51. a more detailed discussion of related work here; a more de- [11] HOPCROFT,J.,AND ULLMAN,J. Introduction to Automata tailed analysis can be found in reference [20]. Theory, Languages, and Computation. Addison-Wesley, Reading, Massachusetts, 1979. 9. Conclusion [12] JARZABEK,S.,AND KRAWCZYK,T.LL-Regular grammars. Information Processing Letters 4, 2 (1975), 31 – 37. ANTLR 4 generates an ALL(*) parser for any CFG with- out indirect or hidden left-recursion. ALL(*) combines the [13] JIM,T.,MANDELBAUM,Y.,AND WALKER, D. Semantics and algorithms for data-dependent grammars. In POPL 2010. simplicity, efficiency, and predictability of conventional top- down LL(k) parsers with the power of a GLR-like mecha- [14] JOHNSON, M. The computational complexity of GLR pars- nism to make parsing decisions. The critical innovation is to ing. In Generalized LR Parsing, M. Tomita, Ed. Kluwer, 1991. shift grammar analysis to parse-time, caching analysis re- [15] KIPPS,J.Generalized LR Parsing. Springer, 1991, pp. 43–59. sults in lookahead DFA for efficiency. Experiments show [16] MCLEAN,P.,AND HORSPOOL, R. N. A faster . ALL(*) outperforms general (Java) parsers by orders of mag- In CC (1996), Springer, pp. 281–293. nitude, exhibiting linear time and space behavior for 8 lan- [17] MCPEAK, S. Elkhound: A fast, practical GLR parser genera- guages. The speed of the ALL(*) Java parser is within 20% tor. Tech. rep., UC Berkeley (EECS), Dec. 2002. of the Java compiler’s hand-tuned recursive-descent parser. [18] MCPEAK,S.,AND NECULA, G. C. Elkhound: A fast, prac- In theory, ALL(*) is O(n4), inline with the low polynomial tical GLR parser generator. In CC (2004), pp. 73–88. bound of GLR. ANTLR is widely used in practice, indi- [19] PARR,T.The Definitive ANTLR 4 Reference. The Pragmatic cating that ALL(*) provides practical parsing power without Programmers, 2013.

593 [20] PARR,T.,AND FISHER,K. LL( ): The Foundation of the to consider the case of unpredicated SLL parsing as ALL(*) ⇤ ANTLR Parser Generator. In PLDI (2011), pp. 425–436. only caches decision results in this case. [21] PARR,T.J. Obtaining practical variants of LL(k) and if case: By induction on the state of the lookahead DFA LR(k) for k>1 by splitting the atomic k-tuple. PhD thesis, for any given decision A. Base case. The first prediction Purdue University, West Lafayette, IN, USA, 1993. for A begins with an empty DFA and must activate ATN [22] PARR,T.J.,AND QUONG, R. W. Adding Semantic and simulation to choose alternative ↵i using prefix u wr. Syntactic Predicates to LL(k)—pred-LL(k). In CC (1994). As ATN simulation yields proper predictions, the ALL(*) [23] PERLIN, M. LR recursive transition networks for Earley and parser correctly predicts ↵i from a cold start and then records Tomita parsing. In Proceedings of the 29th Annual Meeting on the mapping from u : i in the DFA. If there is a single Association for Computational Linguistics (1991), ACL ’91. viable alternative, i is the associated production number. If [24] PLEVYAK, J. DParser: GLR parser generator, Oct. 2013. ATN simulation finds multiple viable alternatives, i is the minimum production number associated with alternatives [25] SCOTT,E.,AND JOHNSTONE, A. GLL parsing. Electron. Notes Theor. Comput. Sci. 253, 7 (Sept. 2010), 177–189. from that set. Induction step. Assume the lookahead DFA correctly pre- [26] TOMITA,M.Efficient Parsing for Natural Language. Kluwer dicts productions for every u prefix of w seen by the parser Academic Publishers, 1986. r at A. We must show that starting with an existing DFA, [27] WOODS, W. A. Transition network grammars for natural ALL(*) properly adds a path through the DFA for unfamiliar language analysis. Comm. of the ACM 13, 10 (1970). u prefix of wr0 . There are several cases:

1. u w0 and u w for a previous w . The lookahead r r r A. Correctness and complexity analysis DFA gives the correct answer for u by induction assump- Theorem A.1. ALL(*) languages are closed under union. tion. The DFA is not updated. 2. w0 = bx and all previous w = ay for some a = b. This r r 6 Proof. Let predicated grammars G1 = (N1,T,P1,S1, ⇧1, case reduces to the cold-start base case because there is 1) and G2 = (N2,T,P2,S2, ⇧2, 2) describe L(G1) b M M no D0 D edge. ATN simulation predicts ↵i and adds and L(G2), respectively. For applicability to both parsers ! path for u w0 from D0 to fi. and scannerless parsers, assume that the terminal space T is r 3. wr0 = vax and wr = vby for some previously seen the set of valid characters. Assume N1 N2 = ? by renam- \ wr with common prefix v and a = b. DFA simulation ing nonterminals if necessary. Assume that the predicates 6 reaches D from D for input v. D has an edge for b and mutators of G and G operate in disjoint environments, 0 1 2 but not a. ATN simulation predicts ↵ and augments and . Construct: i S1 S2 the DFA, starting with an edge on a from D leading eventually to fi. G0 =(N N ,T,P P ,S0, ⇧ ⇧ , ) 1 [ 2 1 [ 2 1 [ 2 M1 [M2 only if case: The ALL(*) parser reports an error for w/ 2 with S0 = S1 S2. Then, L(G0)=L(G1) L(G2). L(G). Assume the opposite, that the parser successfully | [ parses w. That would imply that there exists an ATN con- The ALL(*) parser for non-left-recursive G Lemma A.1. figuration derivation (S,pS, [],w) ⇤ (S0,p0 , [],✏) for w 7! S with lookahead DFA deactivated recognizes sentence w iff through G’s corresponding ATN. But that would require w w L(G). 2 2 L(G). Therefore the ALL(*) parser reports a syntax error for w. The accuracy of the ALL(*) lookahead cache is irrelevant Proof. The ATN for G recognizes w iff w L(G). There- 2 because no path exists through the ATN or parser. fore, we can equivalently prove that ALL(*) is a faithful im- plementation of an ATN. Without lookahead DFA, predic- Lemma A.2. The set of viable productions for LL is always tion is a straightforward ATN simulator: a top-down parser a subset of SLL’s viable productions for a given decision A that makes accurate parsing decisions using GLR-like sub- and remaining input string wr. parsers that can examine the entire remaining input and ATN submachine call stack. Proof. If the move-closure analysis operation does not reach stop state pA0 for submachine A, SLL and LL behave identi- Theorem A.2. (Correctness). The ALL(*) parser for non- cally and so they share the same set of viable productions. left-recursive G recognizes sentence w iff w L(G). If closure reaches the stop state for the decision entry rule, 2 p0 , there are configurations of the form (p0 , ,) where, A A Proof. Lemma A.1 shows that an ALL(*) parser correctly for convenience, the usual GSS is split into single stacks, recognizes w without the DFA cache. The essence of the . In LL prediction mode, = 0, which is either a single proof then is to show that ALL(*)’s adaptive lookahead DFA stack or empty if A=S. In SLL mode, =#, signaling no do not break the parse by giving different prediction deci- stack information. Function closure must consider all possi- sions than straightforward ATN simulation. We only need ble 0 parser stacks. Since any single must be contained

594 within the set of all possible stacks, LL closure operations Theorem A.5. ALL(*) parsing of n input symbols has consider at most the same number of ATN paths as SLL. O(n4) time.

Lemma A.3. For w L(G) and non-left-recursive G, SLL 62 Proof. Worst case, the parser must examine all remaining in- reports a syntax error. put symbols during prediction for each of n input symbols 2 Proof. As in the only if case of Theorem 6.1, there is no giving O(n ) lookahead operations. The cost of each looka- 2 valid ATN configuration derivation for w regardless of how head operation is O(n ) by Lemma A.4 giving overall pars- 4 adaptivePredict chooses productions. ing cost O(n ).

Theorem A.3. Two-stage parsing for non-left-recursive G B. Pragmatics recognizes sentence w iff w L(G). 2 This section describes some of the practical considerations Proof. By Lemma A.3, SLL and LL behave the same when associated with implementing the ALL(*) algorithm. w L(G). It remains to show that SLL prediction either 62 behaves like LL for input w L(G) or reports a syntax B.1 Reducing warm-up time 2 error, signalling a need for the LL second stage. Let and Many decisions in a grammar are LL(1) and they are easy V 0 be the set of viable production numbers for A using SLL to identify statically. Instead of always generating “switch on V and LL, respectively. By Lemma A.2, 0 . There are two adaptivePredict” decisions in the recursive-descent parsers, V ✓V cases to consider: ANTLR generates “switch on token type” decisions when- ever possible. This LL(1) optimization does not affect the 1. If min( )=min( 0), SLL and LL choose the same V V size of the generated parser but reduces the number of looka- production. SLL succeeds for w. E.g., = 1, 2, 3 and V { } head DFA that the parser must compute. 0 = 1, 3 or = 1 and 0 = 1 . V { } V { } V { } Originally, we anticipated “training” a parser on a large 2. If min( ) = min( 0) then min( ) 0 because LL V 6 V V 62 V input corpus and then serializing the lookahead DFA to disk finds min( ) nonviable. SLL would report a syntax error. V to avoid re-computing DFA for subsequent parser runs. As E.g., = 1, 2, 3 and 0 = 2, 3 or = 1, 2 and V { } V { } V { } shown in the Section 7, DFA construction is fast enough that 0 = 2 . V { } serializing and deserializing the DFA is unnecessary. In all possible combinations of and 0, SLL behaves like V V B.2 Semantic predicate evaluation LL or reports a syntax error for w L(G). 2 For clarity, the algorithm described in this paper uses pure Theorem A.4. A GSS has O(n) nodes for n input symbols. ATN simulation for all decisions that have semantic pred- icates on production left edges. In practice, ANTLR uses Proof. For nonterminals N and ATN states Q, there are A lookahead DFA that track predicates in accept states to han- N Q p q ATN transitions if every every grammar | |⇥| | ! dle semantic-context-sensitive prediction. Tracking the pred- position invokes every nonterminal. That limits the number icates in the DFA allows prediction to avoid expensive ATN of new GSS nodes to Q 2 for a closure operation (which | | simulation if predicate evaluation during SLL simulation cannot transition terminal edges). ALL(*) performs n +1 predicts a unique production. Semantic predicates are not closures for n input symbols giving Q 2(n + 1) nodes or | | common but are critical to solving some context-sensitive O(n) as Q is not a function of the input. parsing problems; e.g., predicates are used internally by 2 ANTLR to encode operator precedence when rewriting left- Lemma A.4. Examining a lookahead symbol has O(n ) time. recursive rules. So it is worth the extra complexity to eval- uate predicates during SLL prediction. Consider the predi- Proof. Lookahead is a move-closure operation that computes cated rule from Section 2.1: new target DFA state D as a function of the ATN configu- id : ID | !enum is keyword ? ’enum’ ; 0 { } rations in D. There are Q m Q 2 configurations of The second production is viable only when !enum is keyword | |⇥ ⇡| | the form (p, i, ) D for Q ATN states and m alternative evaluates to true. In the abstract, that means the parser would 2 | | productions in the current decision. The cost of move is not need two lookahead DFA, one per semantic condition. In- a function of input size n. Closure of D computes closure(c) stead, ANTLR’s ALL(*) implementation creates a DFA (via enum c D and closure(c) can walk the entire GSS back to the SLL prediction) with edge D f where f is an aug- 8 2 0 ! 2 2 root (the empty stack). That gives a cost of Q 2 configura- mented DFA accept state that tests !enum is keyword. | | tions times Q 2(n + 1) GSS nodes (per Theorem A.4) or Function adaptivePredict returns production 2 upon enum | | O(n) add operations to build D0. Adding a configuration is if !enum is keyword else throws a no-viable-alternative dominated by the graph merge, which (in our implementa- exception. tion) is proportional to the depth of the graph. The total cost The algorithm described in this paper also does not sup- for move-closure is O(n2). port semantic predicates outside of the decision entry rule. In

595 practice, ALL(*) analysis must evaluate all predicates reach- same parser. For memory efficiency, all ALL(*) parser in- able from the decision entry rule without stepping over a ter- stances for a given language must share lookahead DFA. minal edge in the ATN. For example, the simplified ALL(*) The Java code that ANTLR generates uses a shared mem- algorithm in this paper considers only predicates ⇡1 and ⇡2 ory model and threads for concurrency, which means parsers for the productions of S in the following (ambiguous) gram- must update shared DFA in a thread-safe manner. Multiple mar. threads can be simulating the DFA while other threads are S ⇡ ?Ab ⇡ ?Ab adding states and edges to it. Our goal is thread safety, but !{ 1} |{ 2} A ⇡3 ?a ⇡4 ?a concurrency also provides a small speed up for lookahead !{ } |{ } DFA construction (observed empirically). Input ab matches either alternative of S and, in practice, The key to thread safety in Java while maintaining high ANTLR evaluates “⇡ and (⇡ or ⇡ )” to test the viability 1 3 4 throughput lies in avoiding excessive locking (synchronized of S’s first production not just ⇡ . After simulating S and 1 blocks). There are only two data structures that require lock- A’s ATN submachines, the lookahead DFA for S would be a b ing: Q, the set of DFA states, and , the set of edges. Our D0 D0 f1,2. Augmented accept state f1,2 predicts pro- ! ! implementation factors state addition, Q += D0, into an ad- ductions 1 or 2 depending on semantic contexts ⇡ (⇡ ⇡ ) 1^ 3_ 4 dDFAState function that waits on a lock for Q before testing and ⇡ (⇡ ⇡ ), respectively. To keep track of semantic 2 ^ 3 _ 4 a DFA state for membership or adding a state. This is not context during SLL simulation, ANTLR ATN configurations a bottleneck as DFA simulation can proceed during DFA contain extra element ⇡: (p, i, ,⇡). Element ⇡ carries along construction without locking since it traverses edges to visit semantic context and ANTLR stores predicate-to-production existing DFA states without examining Q. pairs in the augmented DFA accept states. Adding DFA edges to an existing state requires fine- B.3 Error reporting and recovery grained locking, but only on that specific DFA state as our implementation maintains an edge array for each DFA state. ALL(*) prediction can scan arbitrarily far ahead so erroneous We allow multiple readers but a single writer. A lock on test- lookahead sequences can be long. By default, ANTLR- ing the edges is unnecessary even if another thread is racing generated parsers print the entire sequence. To recover, a to set that edge. If edge D D0 exists, the simulation sim- parsers consume tokens until a token appears that could fol- ! ply transitions to D0. If simulation does not find an existing low the current rule. ANTLR provides hooks to override edge, it launches ATN simulation starting from D to com- reporting and recovery strategies. pute D0 and then sets element edge[a] for D. Two threads ANTLR parsers issue error messages for invalid input could find a missing edge on a and both launch ATN sim- phrases and attempt to recover. For mismatched tokens, a ulation, racing to add D D0. D0 would be the same in ! ANTLR attempts single token insertion and deletion to either case so there is no hazard as long as that specific edge resynchronize. If the remaining input is not consistent with array is updated safely using synchronization. To encounter any production of the current nonterminal, the parser con- a contested lock, two or more ATN simulation threads must sumes tokens until it finds a token that could reasonably try to add an edge to the same DFA state. follow the current nonterminal. Then the parser continues parsing as if the current nonterminal had succeeded. ANTLR improves error recovery over ANTLR 3 for EBNF subrules by inserting synchronization checks at the start and “loop” continuation tests to avoid prematurely exiting the subrule. C. Left-recursion elimination For example, consider the following class definition rule. ANTLR supports directly left-recursive rules by rewriting classdef : ’class’ ID ’ ’ member+ ’ ’; them to a non-left-recursive version that also removes any member : ’int’ ID ’;’ ;{ } ambiguities. For example, the natural grammar for describ- ing arithmetic expression syntax is one of the most common An extra semicolon in the member list such as int i;; int (ambiguous) left-recursive rules. The following grammar j; should not force surrounding rule classdef to abort. supports simple modulo and additive expressions. Instead, the parser ignores the extra semicolon and looks for another member. To reduce cascading error messages, the E E%E E+E id parser issues no further messages until it correctly matches ! | | a token. E is directly left-recursive because at least one production begins with E ( E E↵), which is a problem for top- 9 ) B.4 Multi-threaded execution down parsers. Applications often require parallel execution of multiple Grammars meant for top-down parsers must use a cum- parser instances, even for the same language. For exam- bersome non-left-recursive equivalent instead that has a sep- ple, web-based application servers parse multiple incoming arate nonterminal for each level of operator precedence: XML or JSON data streams using multiple instances of the

596 E0 M (+ M)⇤ Additive, lower precedence This non-left-recursive version is still ambiguous because ! M P (% P )⇤ Modulo, higher precedence there are two derivations for a+b+c. The default ambiguity ! P id Primary (id means identifier) resolution policy chooses to match input as soon as possible, ! The deeper the rule invocation, the higher the precedence. resulting in interpretation (a+b)+c. At parse-time, matching a single identifier, a, requires l rule The difference in associativity does not matter for expres- invocations for l precedence levels. sions using a single operator, but expressions with a mixture of operators must associate operands and operators accord- E is easier to read than E0, but the left-recursive version is ambiguous as there are two interpretations of input a+b+c: ing to operator precedence. For example, the parser must (a+b)+c and a+(b+c). Bottom-up parser generators such as recognize a%b+c as (a%b)+c not a%(b+c).The two inter- bison use operator precedence specifiers (e.g., %left ’%’) pretations are shown in Figure 14. to resolve such ambiguities. The non-left- To choose the appropriate interpretation, the generated parser must compare the previous operator’s precedence to E0 is unambiguous because it implicitly encodes precedence the current operator’s precedence in the (% E +E)⇤ “loop.” rules according to the nonterminal nesting depth. | Ideally, a parser generator would support left-recursion In Figure 14, E is the critical expansion of E. It must match and provide a way to resolve ambiguities implicitly with just id and return immediately, allowing the invoking E to the grammar itself without resorting to external precedence match the + to form the parse tree in (a) as opposed to (b). specifiers. ANTLR does this by rewriting nonterminals with To support such comparisons, productions get precedence direct left-recursion and inserting semantic predicates to re- numbers that are the reverse of production numbers. The precedence of the ith production is n i +1for n original solve ambiguities according to the order of productions. productions of E. That assigns precedence 3 to E E % E, The rewriting process leads to generated parsers that mimic ! precedence 2 to E E + E, and precedence 1 to E id. Clarke’s [5] technique. ! ! We chose to eliminate just direct left-recursion because Next, each nested invocation of E needs operator prece- general left-recursion elimination can result in transformed dence information from the invoking E. The simplest mech- grammars orders of magnitude larger than the original [11] anism is to pass a precedence parameter, pr, to E and re- and yields parse trees only loosely related to those of the quire: An expansion of E[pr] can match only those subex- original grammar. ANTLR automatically constructs parse pressions whose precedence meets or exceeds pr. trees appropriate for the original left-recursive grammar so the programmer is unaware of the internal restructuring. Di- To enforce this, the left-recursion elimination procedure in- serts predicates into the (% E + E)⇤ loop. Here is the rect left-recursion also covers the most common grammar | cases (from long experience building grammars). This dis- transformed unambiguous and non-left-recursive rule: cussion focuses on grammars for arithmetic expressions, but E[pr] id ( 3 pr ?%E[4] 2 pr ?+E[3])⇤ the transformation rules work just as well for other left- ! { } |{ } recursive constructs such as C declarators: D D, D !⇤ ! D [], D D (), D id. References to E elsewhere in the grammar become E[0]; ! ! e.g., S E becomes S E[0]. Input a%b+c yields the Eliminating direct left-recursion without concern for am- ! ! biguity is straightforward [11]. Let A ↵ for j =1..s parse tree for E[0] shown in (a) of Figure 15. ! j be the non-left-recursive productions and A A for Production “ 3 pr ?%E[4]” is viable when the prece- ! k { } k =1..r be the directly left-recursive productions where dence of the modulo operation, 3, meets or exceeds parame- ter pr. The first invocation of E has pr =0and, since 3 0, ↵j,k ⇤ ✏. Replace those productions with: 6) the parser expands “% E[4]” in E[0]. When parsing invocation E[4], predicate 2 pr ? fails A ↵1A0 ... ↵sA0 { } ! | | because the precedence of the + operator is too low: 2 4. A0 1A0 ... rA0 ✏ 6 ! | | | Consequently, E[4] does not match the + operator, deferring The transformation is easier to see using EBNF: to the invoking E[0]. A key element of the transformation is the choice of A A0A00⇤ E parameters, E[4] and E[3] in this grammar. For left- ! A0 ↵ ... ↵ associative operators like % and +, the right operand gets ! 1| | s A00 1 ... r one more precedence level than the operator itself. This ! | | guarantees that the invocation of E for the right operand or just A (↵ ... ↵ )( ... )⇤. For example, the left- ! 1| | s 1| | r matches only operations of higher precedence. recursive E rule becomes: For right-associative operations, the E operand gets the same precedence as the current operator. Here is a variation E id (% E + E)⇤ ! | on the expression grammar that has a right-associative as-

597 E E

E[0] E ! E E %1 E - E[4] !! E ! E[0] E % E %1 E + E +2 E - E - E[4] - E[4] % E[3] - E E ! - E 2 3 3 a a a !b a b (a) (a%b)+c (b) a%(b+c) (a) --a!! (b) --a!! (c) -a%b! (d) -a%b! Figure 14. Parse trees for a%b+c and E id (% E + E)⇤ ((-(-a))!)! (-a)%(b!) ! | Figure 16. Nonterminal call trees and parse trees for pro- ductions E E E ! E%E id E[0] ! | | | E[0] a = E[2] therefore, “evaluated” in the order specified. E.g., E ! E + E id must interpret -+a as -(+a) not +(-a). a % E[4] + E[3] | | b = E[2] . Nonconforming left-recursive productions E E or ! E ✏ are rewritten without concern for ambiguity using b c c ! the typical elimination technique. (a) a%b+c (b) a=b=c Because of the need to resolve ambiguities with predi- assoc (a%b)+c assoc a=(b=c) cates and compute A parameters,

Figure 15. Nonterminal Expansion Trees for nonterminal C.1 Left-Recursion Elimination Rules E[pr] id ( 3 pr ?%E[4] 2 pr ?+E[3])⇤ ! { } |{ } To eliminate direct left-recursion in nonterminals and re- solve ambiguities, ANTLR looks for the four patterns: signment operator instead of the addition operator: A A↵ A (binary and ternary operator) i ! i right A A↵ (suffix operator) E E % E E = E id i ! i ! | | A ↵ A (prefix operator) i ! i Ai ↵i (primary or “other”) where notation =right is a shorthand for the actual ANTLR ! right syntax “ E = E.” The interpretation The subscript on productions, Ai, captures the production | of a=b=c should be right associative, a=(b=c). To get that number in the original grammar when needed. Hidden and associativity, the transformed rule need differ only in the indirect left-recursion results in static left-recursion errors right operand, E[2] versus E[3]: from ANTLR. The transformation procedure from G to G0 is: E[pr] id ( 3 pr ?%E[4] 2 pr ?=E[2])⇤ ! { } |{ } 1. Strip away directly left-recursive nonterminal references The E[2] expansion can match an assignment, as shown 2. Collect prefix, primary productions into newly-created in (b) of Figure 15, since predicate 2 2 is true. A0 Unary prefix and suffix operators are hardwired as right- 3. Collect binary, ternary, and suffix productions into newly- and left-associative, respectively. Consider the following E created A00 with negation prefix and “not” suffix operators. 4. Prefix productions in A00 with precedence-checking se- mantic predicate pr(i)>= pr ? where pr(i)= n E E E ! E % E id { } { ! | | | i +1 } Prefix operators are not left recursive and so they go into the 5. Rewrite A references among binary, ternary, and prefix first subrule whereas left-recursive suffix operators go into productions as A[nextpr(i, assoc)] where the predicated loop like binary operators: nextpr(i, assoc)= assoc == left ? i +1:i { } 6. Rewrite any other A references within any production in E[pr] (id E[4]) ! | P (including A0 and A00) as A[0] ( 3 pr ?! 2 pr ?%E[3])⇤ { } |{ } 7. Rewrite the original A rule as A[pr] A0A00⇤ Figure 16 illustrates the rule invocation tree (a record of ! the call stacks) and associated parse trees resulting from an In practice, ANTLR uses the EBNF form rather than A0A00⇤. ANTLR-generated parser. Unary operations in contiguous productions all have the same relative precedence and are,

598