LL(*): the Foundation of the ANTLR Parser Generator DRAFT
Total Page:16
File Type:pdf, Size:1020Kb
LL(*): The Foundation of the ANTLR Parser Generator DRAFT Accepted to PLDI 2011 Terence Parr Kathleen S. Fisher University of San Francisco AT&T Labs Research [email protected] kfi[email protected] Abstract In the \top-down" world, Ford introduced Packrat parsers Despite the power of Parser Expression Grammars (PEGs) and the associated Parser Expression Grammars (PEGs) [5, and GLR, parsing is not a solved problem. Adding nonde- 6]. PEGs preclude only the use of left-recursive grammar terminism (parser speculation) to traditional LL and LR rules. Packrat parsers are backtracking parsers that attempt parsers can lead to unexpected parse-time behavior and in- the alternative productions in the order specified. The first troduces practical issues with error handling, single-step de- production that matches at an input position wins. Pack- bugging, and side-effecting embedded grammar actions. This rat parsers are linear rather than exponential because they paper introduces the LL(*) parsing strategy and an asso- memoize partial results, ensuring input states will never be ciated grammar analysis algorithm that constructs LL(*) parsed by the same production more than once. The Rats! parsing decisions from ANTLR grammars. At parse-time, [7] PEG-based tool vigorously optimizes away memoization decisions gracefully throttle up from conventional fixed k ≥ events to improve speed and reduce the memory footprint. 1 lookahead to arbitrary lookahead and, finally, fail over to A significant advantage of both GLR and PEG parser backtracking depending on the complexity of the parsing de- generators is that they accept any grammar that conforms cision and the input symbols. LL(*) parsing strength reaches to their meta-language (except left-recursive PEGs). Pro- into the context-sensitive languages, in some cases beyond grammers no longer have to wade through reams of conflict what GLR and PEGs can express. By statically removing as messages. Despite this advantage, neither GLR nor PEG much speculation as possible, LL(*) provides the expressiv- parsers are completely satisfactory, for a number of reasons. ity of PEGs while retaining LL's good error handling and First, GLR and PEG parsers do not always do what unrestricted grammar actions. Widespread use of ANTLR was intended. GLR silently accepts ambiguous grammars, (over 70,000 downloads/year) shows that it is effective for a those that match the same input in multiple ways, forcing wide variety of applications. programmers to detect ambiguities dynamically. PEGs have no concept of a grammar conflict because they always choose the “first” interpretation, which can lead to unexpected or 1. Introduction inconvenient behavior. For example, the second production of PEG rule A ! ajab (meaning \A matches either a Parsing is not a solved problem, despite its importance and or ab") will never be used. Input ab never matches the long history of academic study. Because it is tedious and second alternative since the first symbol, a, matches the error-prone to write parsers by hand, researchers have spent first alternative. In a large grammar, such hazards are not decades studying how to generate efficient parsers from high- always obvious and even experienced developers can miss level grammars. Despite this effort, parser generators still them without exhaustive testing. suffer from problems of expressiveness and usability. Second, debugging nondeterministic parsers can be very When parsing theory was originally developed, machine difficult. With bottom-up parsing, the state usually repre- resources were scarce, and so parser efficiency was the sents multiple locations within the grammar, making it dif- paramount concern. In that era, it made sense to force ficult for programmers to predict what will happen next. programmers to contort their grammars to fit the con- Top-down parsers are easier to understand because there is straints of LALR(1) or LL(1) parser generators. In con- a one-to-one mapping from LL grammar elements to parser trast, modern computers are so fast that programmer ef- operations. Further, recursive-descent LL implementations ficiency is now more important. In response to this de- allow programmers to use standard source-level debuggers velopment, researchers have developed more powerful, but to step through parsers and embedded actions, facilitat- more costly, nondeterministic parsing strategies following ing understanding. This advantage is weakened significantly, both the \bottom-up" approach (LR-style parsing) and the however, for backtracking recursive-descent packrat parsers. \top-down" approach (LL-style parsing). Nested backtracking is very difficult to follow! In the \bottom-up" world, Generalized LR (GLR) [16] Third, generating high-quality error messages in nonde- parsers parse in linear to cubic time, depending on how terministic parsers is difficult but very important to com- closely the grammar conforms to classic LR. GLR essentially mercial developers. Providing good syntax error support re- \forks" new subparsers to pursue all possible actions ema- lies on parser context. For example, to recover well from an nating from nondeterministic LR states, terminating any invalid expression, a parser needs to know if it is parsing subparsers that lead to invalid parses. The result is a parse an array index or, say, an assignment. In the first case, the forest with all possible interpretations of the input. Elkhound parser should resynchronize by skipping ahead to a ] token. [10] is a very efficient GLR implementation that achieves In the second case, it should skip to a ; token. Top-down yacc-like parsing speeds when grammars are LALR(1). Pro- parsers have a rule invocation stack and can report things grammers unfamiliar with LALR parsing theory, though, like \invalid expression in array index." Bottom-up parsers, can easily get nonlinear GLR parsers. on the other hand, only know for sure that they are match- speculating parsers, which means it supports source-level de- ing an expression. They are typically less able to deal well bugging, produces high-quality error messages, and allows with erroneous input. Packrat parsers also have ambiguous programmers to embed arbitrary actions. A survey of 89 context since they are always speculating. In fact, they can- ANTLR grammars [1] available from sorceforce.net and not recover from syntax errors because they cannot detect code.google.com reveals that 75% of them had embedded errors until they have seen the entire input. actions, counting conservatively, which reveals that such ac- Finally, nondeterministic parsing strategies cannot easily tions are a useful feature in the ANTLR community. support arbitrary, embedded grammar actions, which are Widespread use shows that LL(*) fits within the pro- useful for manipulating symbol tables, constructing data grammer comfort zone and is effective for a wide vari- structures, etc. Speculating parsers cannot execute side- ety of language applications. ANTLR 3.x has been down- effecting actions like print statements, since the speculated loaded 41,364 (binary jar file) + 62,086 (integrated into action may never really take place. Even side-effect free ANTLRworks) + 31,126 (source code) = 134,576 times ac- actions such as those that compute rule return values can be cording to Google Analytics (unique downloads January 9, awkward in GLR parsers [10]. For example, since the parser 2008 - October 28, 2010). Projects using ANTLR include can match the same rule in multiple ways, it might have Google App Engine (Python), IBM Tivoli Identity Man- to execute multiple competing actions. (Should it merge all ager, BEA/Oracle WebLogic, Yahoo! Query Language, Ap- results somehow or just pick one?) GLR and PEG tools ple XCode IDE, Apple Keynote, Oracle SQL Developer IDE, address this issue by either disallowing actions, disallowing Sun/Oracle JavaFX language, and NetBeans IDE. arbitrary actions, or relying on the programmer to avoid This paper is organized as follows. We first introduce side-effects in actions that could be executed speculatively. ANTLR grammars by example (Section 2). Next we for- mally define predicated grammars and a special subclass 1.1 ANTLR called predicated LL-regular grammars (Section 3). We then describe LL(*) parsers (Section 4), which implement pars- This paper describes version 3.3 of the ANTLR parser gen- ing decisions for predicated LL-regular grammars. Next, we erator and its underlying top-down parsing strategy, called give an algorithm that builds lookahead DFA from ANTLR LL(*), that address these deficiencies. The input to ANTLR grammars (Section 5). Finally, we support our claims regard- is a context-free grammar augmented with syntactic [14] and ing LL(*) efficiency and reduced speculation (Section 6). semantic predicates and embedded actions. Syntactic pred- icates allow arbitrary lookahead, while semantic predicates allow the state constructed up to the point of a predicate to 2. Introduction to LL(*) direct the parse. Syntactic predicates are given as a gram- In this section, we give an intuition for LL(*) parsing by ex- mar fragment that must match the following input. Semantic plaining how it works for two ANTLR grammar fragments predicates are given as arbitrary Boolean-valued code in the constructed to illustrate the algorithm. Consider nontermi- host language of the parser. Actions are written in the host- nal s, which uses the (omitted) nonterminal expr to match language of the parser and have access to the current state. arithmetic expressions. As with PEGs, ANTLR requires programmers to avoid