Nez: Practical Open Grammar Language
Kimio Kuramitsu Yokohama National University, JAPAN [email protected] http://nez-peg.github.io/
Abstract tions. Developers use a formal grammar, such as LALR(k), LL(k), Nez is a PEG(Parsing Expressing Grammar)-based open gram- or PEG, to specify programming language syntax. Based on the mar language that allows us to describe complex syntax constructs grammar specification, a parser generator tool, such as Yacc[14], without action code. Since open grammars are declarative and ANTLR3/4[29], or Rats![9], produces an efficient parser code that free from a host programming language of parsers, software engi- can be integrated with the host language of the compiler and inter- neering tools and other parser applications can reuse once-defined preter. This generative approach, obviously, enables developers to grammars across programming languages. avoid error-prone coding for lexical and syntactic analysis. A key challenge to achieve practical open grammars is the ex- Traditional parser generators, however, are not entirely free pressiveness of syntax constructs and the resulting parser perfor- from coding. One particular reason is that the formal grammars mance, as the traditional action code approach has provided very used today lack sufficient expressiveness for many of the pop- pragmatic solutions to these two issues. In Nez, we extend the ular programming language syntaxes, such as typedef in C/C++ symbol-based state management to recognize context-sensitive lan- and other context-sensitive syntaxes[3, 8, 9]. In addition, a formal guage syntax, which often appears in major programming lan- grammar itself is a syntactic specification of the input language, not guages. In addition, the Abstract Syntax Tree constructor allows us a schematic specification for tree representations that are needed in to make flexible tree structures, including the left-associative pair semantic analysis. To make up for these limitations, most parser of trees. Due to these extensions, we have demonstrated that Nez generators take an ad hoc approach called semantic action, a frag- can parse not all but many grammars. ment of code embedded in a formal grammar specification. The Nez can generate various types of parsers since all Nez op- embedded code is written in a host language and combined in a erations are independent of a specific parser language. To high- way that it is hooked at a parsing context that requires extended light this feature, we have implemented Nez with dynamic pars- recognition or tree constructions. ing, which allows users to integrate a Nez parser as a parser library The problem with arbitrary action code is that they decrease that loads a grammar at runtime. To achieve its practical perfor- the reusability of the grammar specification[28]. For example, Yacc mance, Nez operators are assembled into low-level virtual machine cannot generate the Java version of a parser simply because Yacc instructions, including automated state modifications when back- uses C-based action code. There are many Yacc ports to other tracking, transactional controls of AST construction, and efficient languages, but they do not reuse the original Yacc grammar without memoization in packrat parsing. We demonstrate that Nez dynamic porting of action code. This is an undesirable situation because the parsers achieve very competitive performance compared to existing well-defined grammar is demanded everywhere in IDEs and other efficient parser generators. software engineering tools[18]. Nez is a grammar language designed for open grammars. In Categories and Subject Descriptors D.3.1 [Programming Lan- other words, our aim is that once we write a grammar specification, guages]: Formal Definitions and Theory – Syntax; D.3.4 [Pro- anyone can use the grammar in any programming language. To gramming Languages]: Processors – Parsing achieve the openness of the grammar specification, Nez eliminates any action code from the beginning, and provides a declarative General Terms Languages, Algorithms but small set of operators that allow the context-sensitive parsing arXiv:1511.08307v1 [cs.PL] 26 Nov 2015 Keywords Parsing expression grammars, Context-sensitive gram- and flexible Abstract Syntax Tree construction that are needed for mars, Packrat parsing popular programming language syntaxes. This paper presents the implementation status of Nez with the experience of our grammar 1. Introduction development. The Nez grammar language is based on parsing expression A parser generator is a standard approach to reliable parser de- grammars (PEGs), a popular syntactic foundation formalized by velopment in programming languages and many parser applica- Ford[8]. PEGs are simple and portable and have many desirable properties for defining modern programming language syntax, but they still have several limitations in terms of defining popular pro- gramming languages. Typical limitations include typedef-defined name in C/C++ [8, 9], delimiting identifiers (such as < Unpublished, draft paper to be revised. Comments and suggestions are welcome 1 2015/11/30 parsed substrings to be handled as symbols, a specialized word PEG Type Proc. Description in another context. The symbol table is a runtime, growing, and ’’ Primary 5 Matches text recursive storage for such symbols. If we handle typedef-defined [] Primary 5 Matches character class names as symbols for example, we can realize different parsing . Primary 5 Any character behavior through referencing the symbol table. Nez provides a A Primary 5 Non-terminal application small set of symbol operators, including matching, containment (e) Primary 5 Grouping testing, and scoping. More importantly, since the symbol table e? Unary suffix 4 Option is a single, unified data structure for managing various types of e∗ Unary suffix 4 Zero-or-more repetitions symbols, we can easily trace state changes and then automate e+ Unary suffix 4 One-or-more repetitions state management when backtracking. In addition, the symbol table &e Unary prefix 3 And-predicate itself is simply a stack-based data structure; we can translate it into !e Unary prefix 3 Negation any programming language. e1e2 Binary 2 Sequencing Another important role of the traditional action code is the con- e1/e2 Binary 1 Prioritized Choice struction of ASTs. Basically, PEGs are just a syntactic specifica- AST tion, not a schematic specification for tree structures that represent { e } Construction 5 Constructor AST nodes. In particular, it is hard to derive the left-associative $(e) Construction 5 Connector pair of trees due to the limited left recursion in PEGs. In Nez, we {$ e } Construction 5 Left-folding introduce an operator to specify tree structures, modeled on cap- #t Construction 5 Tagging turing in perl compatible regular expressions (PCREs), and extend ‘‘ Construction 5 Replaces a string tree manipulations including tagging for typing a tree, labeling for Symbols sub-nodes, and left-folding for the left-associative structure. Due Unpublished, draft paper to be revised. Comments and suggestions are welcome 2 2015/11/30 • scanner-less parsing, avoiding the extensive use of lexer hacks Expr = Prod {$left (’+’ #Add / ’-’ #Sub) $right(Prod)}* such as in C++ grammars, and Prod = Val {$left (’*’ #Mul / ’/’ #Div) $right(Val)}* Val = { [0-9]+ #Int } • unlimited lookahead, recognizing highly nested structures such as {an bn cn | n > 0}. 1 + 2 * 3 1 * 2 + 3 Nez inherits all these properties from PEGs, and has extended #Mul #Mul features based on AST operators, symbol operators and conditional $left $right $left $right parsing, as listed in Table 1. #Add #Int [3] #Int [1] #Add 2.2 AST Operators $left $right $left $right The first extension of parsing expressions in Nez is a flexible #Int [1] #Int [2] #Int [1] #Int [2] construction of AST representations. Each node of an AST contains a substring extracted from the input characters. Nez has adopted Figure 1. Mathematical basic operators and example of ASTs an PCRE-like capturing operator, denoted by {}, to capture the substring. The productions Int and Long below are examples of capturing a sequence of digits (as defined in NUM.) Note that the AST representation in Nez is a so-called common NUM = [0-9]+ tree, a sort of generalized data structure. However, we can map a Int = { NUM } tag to a class and then map labels to fields in a mapped class, which Long = { NUM } [Ll] leads to automated conversion of concrete tree objects. Although mapping and type checking are an interesting challenge, they are A string to be captured is the exact one matched with NUM. That beyond the scope of this paper. is, Long accepts the input 0L but captures the only substring 0. An To the end, we show an example of mathematical basic oper- empty string may be captured by an empty expression. ators written in Nez. Figure 1 shows some of ASTs constructed Tagging is introduced to distinguish the type of nodes. We can when evaluate the nonterminal Expr. just add a #-prefixed tag to each of the nodes. 2.3 Symbol Table Int = { NUM #Int } PEGs are very expressive, but they sometimes suffer from insuffi- Long = { NUM #Long } [Ll] cient expressiveness for context-sensitive syntax, appearing even in A backquoat operator ‘‘ is a string operator that replaces the popular programming languages such as: captured string with the specified string. Using the empty capture, • Typedef-defined name in C/C++ [8, 9] we are allowed to create an arbitrary node at any point of a parsing expression. • Here document in Perl, Ruby, and many other scripting lan- guages DefaultValue = { ‘0‘ #Int } • Indentation-based code layout in Python and Haskell [1] The connector $(e) is used to make a tree structure by con- • Contextual keywords used in C# and other evolving languages[5] necting two AST nodes in a parent-child relationship. The prefix $ is used to specify a child node and append it to a node that is The problem with context-sensitive syntax is that PEGs are as- constructed on the left-hand side of $(e). As a result, the tree is sumed stateless parsing and cannot handle state changes in the constructed as the natural order of left-to-right and top-down pars- parser context. However, most of the state changes above can ing. be modeled by a string specialization; the meaning of words is Basically, the AST operators can transform a sequence of nodes changed in a certain context. We call such a specialized string a into a tree-formed structure. We can specify that a subtree is nested, symbol. Nez is designed to provide a symbol table and related sym- flattened, or ignored (by dropping the connector). In the following, bol operators to manage symbols in the parser context. we make the construction of a flattened list (#Add 1 2 3) and a To illustrate what a symbol is, let us suppose that the production nested right-associative pair (#Add 1 (#Add 2 3)) for the same NAME is defined: input 1+2+3. NAME = [A-Za-z] [A-Za-z0-9]* List = { $(Int) (’+’ $(int))* #Add } Binary = { $(Int) (’+’ $(Binary)) #Add } The symbolization operator takes the form of Add = Int {$left ’+’ $right(int) #Add}* Unpublished, draft paper to be revised. Comments and suggestions are welcome 3 2015/11/30 Here, we consider a simple case where XML tags need to match USERTYPE = [A-Za-z_] !W* closed tags with the same open tags. W = [A-ZA-z_0-9] S = [ \t\r\n] ... TypeDef Briefly, an XML element can be specified as follows: = ’typedef’ S* TypeName S* Unpublished, draft paper to be revised. Comments and suggestions are welcome 4 2015/11/30 A, B, C ∈ N Nonterminals e ::= : empty a, b ∈ Σ Characters | A : nonterminal x, y, z ∈ Σ∗ Sequence of characters | a : character xy Concatenation of x and y | e e0 : sequence e Parsing expressions | e / e0 : prioritized choice A = e Production | e∗ : repetition T Tree node | &e : and predicate [A, x] A-specialized symbol x | !e : not predicate S = [A, x][A0, y] State, sequence of labeled symbols | { e } : constructor empty string or sequence | {$ e } : left-folding | $(e) : connector Figure 3. Notations uses throughout this paper | #t : tag | ‘x‘ : string replacement | hsymbol Ai : symbol definition Input: if(a > b) return a; else { return b; } | hexists A : symbol existence | hexists A xi : symbol existence #If | hmatch Ai : symbol match | his Ai : symbol equivalence | hisa Ai : symbol containment | hblock ei : nested scope #GreaterThan #Return #Return | hlocal A ei : isolated local scope #Variable #Variable #Variable #Variable Figure 5. An abstract syntax of Nez expressions 'a' 'b' 'a' 'b' The initial state of the AST node is an empty tree. new(x) is a Figure 4. Pictorial notation of ASTs function that instantiates a new AST node including a given string x. Let v be a reference of a node of ASTs. That is, we say v = v0 if 0 Note that conditional parsing is a Boolean version of the symbol v is the mutated v with the following tree manipulation functions: C C = c e table. Let be an empty expression (i.e., ’’). Unpublished, draft paper to be revised. Comments and suggestions are welcome 5 2015/11/30 { e } (S, x, v) −→ (S, x, v) e (S, xy, v) −→ (SS0, y, v0) a a {$ e } (S, ax, v) −→ (S, x, v) (S, xy, T ) −−−−→ (SS0, y, new(x)) A = e e 0 0 $(e) (S, xy, v) −→ (SS , y, v ) e (S, xy, v) −→ (S0, y, v0) A (S, xy) −→ (SS0, y, v0) $(e) (S, xy, v) −−−→ (S0, y, link(v, v0)) e e0 {$ e } e 0 0 e e0 (S, xy, v) −→ ( S, y, v ) (S, xyz, v) −→ (SS0, yz, v0)(SS0, yz, v0) −→ (SS0S00, z, v00) {$ e } e e0 (S, xy, v) −−−−→ (0S, y, link(new(x), v)) (S, x, v) −−→ (SS0S00, z, v00) #t e/e0 #t e (S, x, v) −−→ (S, x, tag(v, #t)) (S, xy, v) −→ (SS0, y, v0) e/e0 ‘x‘ ‘x‘ (S, xy, v) −−−→ (SS0, y, v0) (S, x, v) −−→ (S, x, replace(v, x)) e e0 (S, xy, v) −→• (S, xy, v) −→ (SS0, y, v0) Figure 7. Semantics of AST operators e/e0 (S, xy, v) −−−→ (SS0, y, v0) &e e (S, xy, v) −→ (SS0, y, v0) hsymbol Ai &e A 0 (S, xy, v) −−→ (SS0, xy, v0) (S, xy, T ) −→ (S, y, T ) hsymbol Ai 0 !e e (S, xy, T ) −−−−−−−−→ (S[A, x], y, T ) (S, x, v) −→• !e hblock ei e (S, x, v) −→ (S, x, v) (S, xy, T ) −→ (SS0, y, T 0) hblock ei Figure 6. Semantics of PEG operators (S, xy, T ) −−−−−−→ (S, y, T 0) hlocal A ei e 0 0 e The transition form (S, xy, v) −→ (S , x, v ) may be read: In S = del(S, A)(S , xy, T ) −→ (S0, y, T 0) symbol table S, the input sequence xy is reduced to characters y A¯ A hlocal A ei while evaluating e (i.e, e consumes x). Simultaneously, the seman- (S, xy, T ) −−−−−−−→ (S, y, T ) tic value v is evaluated to v0. The symbol table is unchanged if e 0 0 0 hexists Ai (S, x, v) −→ (SS , y, v ) and S is empty. Likewise, the AST node ∃z[A, z] ∈ S e 0 0 0 is not mutated if (S, x, v) −→ (S , y, v ) and v = v . hexists Ai (S, x, T ) −−−−−−−→ (S, x, T ) We write • for a special failure state, suggesting backtracking to the alternative if it exists. Here is an example failure transition. hexists A xi ∃[A, z] ∈ S a hexists A xi (S, bx, v) −→• (a 6= b) (S, x, T ) −−−−−−−−→ (S, x, T ) Figure 6 is the definition of the operational semantics of PEG hmatch Ai operators. Due to the space constraint, we omit the failure transi- top(S, A) = z tion case. Instead, undefined transitions are regarded as the failure hmatch Ai transition. The semantics of Nez is conservative and shares with the (S, zx, T ) −−−−−−−→ (S, x, T ) same interpretation of PEGs. Notably, the and-predicate &e allows his Ai A the state transitions of S 7→ SS0 and v 7→ v0, meaning that the (S, zx, T ) −→ (S, x, T 0) top(S, A) = z his Ai lookahead definition of symbols and the duplication of local AST 0 nodes. (S, zx, T ) −−−−→ (S, x, T ) Figure 7 and Figure 8 show the semantics of AST operators and hisa Ai A symbol operators. The recognition of Nez is AST-independent, be- (S, zx, T ) −→ (S, x, T 0) ∃[A, z] ∈ S cause any values v are not premise for the transitions. Accordingly, hisa Ai (S, zx, T ) −−−−−→ (S, x, T 0) an AST-eliminated grammar accepts the same input with the origi- nal Nez grammar. Figure 8. Semantics of Nez’s symbol operators Formally, the language generated by the expression e is L(S, e) = {x|(S, xy) −→e (S0, y)}, and the language of grammar L(G) is e L(G) = {x|(, xy) −→s (S, y)}. Note that the semantic value is unnecessary in the language definition. As suggested in the study is that conditional parsing can be replaced with symbol operators of predicated grammars[29], the language class of L(G) is consid- and an empty symbol. In addition, we can eliminate conditions ered to be context-sensitive because predicates can check both the from a Nez grammar, and Nez parsers usually run based on the left and right context. eliminated grammar. This section describes how to eliminate the conditions. For simplicity, we consider the a single parsing condition, la- 4. Elimination of Parsing Condition beled c. Let x be a Boolean variable such as x ∈ {c, !c}, where !c In the previous section, we described the formal definition of Nez stand for not c, and f(e, x) be a conversion function of an expres- grammar language without conditional parsing. One reason for this sion e into the eliminated one. Unpublished, draft paper to be revised. Comments and suggestions are welcome 6 2015/11/30 Here we write G¯ for the eliminated grammar. Eliminating c- function calls. Backtracking is simply implemented over the call condition from G is a conversion from G into G¯, and defined: for stacks. each production A = e in G, two new productions Ac and A!c are Notoriously, backtracking might cause an exponential time added into G¯, as follows: parsing in the worst cases, while packrat parsing[7] is well es- tablished to guarantee the linear time parsing with PEGs. The idea behind packrat parsing is a memoization whereby all results of Ac = f(e, c) A!c = f(e, !c) nonterminal calls are stored in the memoization table to avoid re- Now, we define the conversion function f(e, x) recursively: dundant nonterminal calls. Although the memoization table may require some complexity for its efficiency, we use a simple and • f(A, x) = Ac if x = c, or A!c if x =!c constant-memory memoization table, presented in [21]. • f(e1e2, x) = f(e1, x)f(e2, x) In addition to the standard PEG parser runtime, Nez parsers re- • f(e1/e2, x) = f(e1, x)/f(e2, x) quire two additional state management to handle ASTs and sym- bols. Note that we assume that conditions are eliminated upfront, • f(&e, x) = &f(e, x) as described in Section 4. • f(!e, x) = !f(e, x) The AST construction runtime provides a parser with APIs • f( Unpublished, draft paper to be revised. Comments and suggestions are welcome 7 2015/11/30 e / e0 {symbolTable.startBlock()} τ(A = e, L) = A τ(e, L0) e { symbolTable.commit() } L0 ret / {symbolTable.abort()} e0 6. Case Studies and Experiences of parser productions through eliminating the conditional opera- We have developed many Nez grammars, ranging from program- tors. The same numbers of productions in #G and #P means no use ming languages to data formats. All developed grammars are avail- of conditional parsing. The column labeled ”Symbols and Condi- able online at http://github.com/nez-peg/nez-grammar. tions” indicates nonterminals and conditions that are used in the Table 3 shows a summary of the major developed grammars.The grammar. column labeled ”#G” indicates the number of defined productions Here are some short comments on each of the interesting devel- in a grammar, while the column labeled ”#P” indicates the number oped grammars. Unpublished, draft paper to be revised. Comments and suggestions are welcome 8 2015/11/30 • C – Based on two PEG grammars written in Mouse. The se- mantic code embedded to handle the typedef-defined name is 1000 converted into the TypeDef symbol. The nested typedef state- ments are not implemented. • C# – Developed from scratch, referencing C#5.0 Language 100 Specification (written in a natural language). The condition AwaitKeyword is used to express contextual keyword await. 10 • Java8 – Ported from Java8 grammar written in ANTLR41. Java can be specified without any Nez extensions. Latency [ms] 2 • JavaScript – Based on JavaScript grammar for PEG.js . We 1 simplify the grammar specification using the parsing condition parse match parse match parse match parse match InOperator that distinguishes expression rules from the for/in nez-j cnez nez-c nez-js context. processing 252 91 58 13 112 30 1746 476 jquery 37 27 9 4 16 9 241 120 • Konoha – Developed from scratch. Konoha is a statically typed java20 110 70 25 8 55 25 801 494 scripting language, which we have designed in [20]. The most xmark 185 117 41 9 118 35 753 405 syntactic constructs come from Java, but we use the parsing condition SemicolonInsertion for expressing the condi- Figure 12. Performance comparison of differently implemented tional end of a statement. More importantly, Konoha is a lan- Nez parsers: nez-j, nez-c, cnez, and nez-js guage implementation that uses ASTs constructed by the Nez parser. • Python3 – Ported from Python 3.0 abstract grammar3. The Intel Core i7-4770, 8GB of DDR3 RAM, and running on Linux Indent is used to express the indentation-based code block by Ubuntu 14.04.1 LTS. All C programs are compiled with clang- capturing white spaces. 3.6, all Java parsers run on Oracle JDK 1.8, and all JavaScript • Ruby – Developed from scratch. The Delimiter is used to parsers run on node.js version 5.0. Tests are run several times, and express the context-sensitive delimiting identifier in the Here we record the average time for each iteration. The execution time document. is measured in millisecond by System.nanoTime() in Java APIs and gettimeofday() in Linux. Tested source files are randomly • Haskell – Postponed. We didn’t express the indent symbol for collected from major open source repositories. To improve the time Haskell’s code layout since the indentation starts in the middle accuracy, we eliminate small files from the data set. of expressions. This is mainly because of a limitation of a All tested Nez grammars are available online as described in symbol operator in Nez; if Nez provided Haskell-specialized Section 6. For convenience, we label tested grammars: java.nez, symbol operator such as Unpublished, draft paper to be revised. Comments and suggestions are welcome 9 2015/11/30 Java (java.nez) JavaScript (js.nez) Python (python3.nez) 100 100 100 nez-j cnez nez-c nez-j cnez nez-c nez-j cnez nez-c 10 10 10 1 1 1 Latency [ms] 0.1 0.1 0.1 0.01 0.01 0.01 1 10 100 1 10 100 1000 1 10 100 File Size [KiB] Figure 11. Parsing time of the open source files for Java, JavaScript, and Python against the input size plotted as log-log base 10. 10000 1000 100 Latency [ms] 10 1 parse match parse match match parse match parse match parse parse match parse cnez nez-c yacc nez-j Rats! antlr4 nez-js pegjs processing 58 13 112 30 18 252 91 1598 1483 9167 1745 476 jquery 9 4 16 9 5 37 27 584 579 2199 240 120 1627 java20 25 8 55 25 110 70 191 128 150 801 494 2318 Figure 13. Performance comparison of processing, jquery, and java20 in various parsers. The pegjs parser cannot end parsing process due to its out of memory errors. 200 250 180 200 160 140 150 120 100 100 50 80 Latency [ms] 60 0 Latency [ms] parse match parse match parse match 40 nez-j cnez nez-c 20 non (xml) 185 117 41 9 118 35 0 symbol (xmlsym) 217 158 44 11 137 38 xerces- xerces- cnez nez-c yacc libxml nez-j antlr4 c java non (xml) symbol (xmlsym) parse 41 118 85 74 185 195 75 match 9 35 69 49 31 117 56 Figure 15. Effects of symbol operators Figure 14. Parsing time in xmark 2.5 times faster than nez-java, and almost 10 times faster than the nez-js. • xmark – a synthetic and scalable XML files that are provided by Next, we will turn to the performance comparison with existing XMark benchmark program [32]. We used a newly generated standard parsers. Here, we have chosen Yacc[14], ANTLR4[30], file in 10MB. Rats![9], and PEGjs[24], since they notably produce efficient parsers and are accepted in many serious projects. We set up these Figure 12 shows the parse time of Nez parsers. The data point parser generators as follows. parse indicates the parsing time including the AST construction, while match indicates the parsing time without the AST construc- • yacc – a standard LALR parser generator for C. In this exper- tion. The time scale is plotted as log base 10 due to the variety of iment, we suffered from the limited availability of grammars, results. The cnez parsers are fastest as easily predicted. In com- and could only run a JavaScript parser without AST construc- parisons among the parsing machines, the nez-c parser is almost tion. Unpublished, draft paper to be revised. Comments and suggestions are welcome 10 2015/11/30 • antlr4 – a standard parser generator based on the extended Data-dependent grammars [3, 13] share similar ideas in terms of LL(k) with some packrat parsing feature. All tested grammars recognizing context-sensitive syntax. Briefly, these grammars uses are derived from the official ANTLR4 grammar repository. a variable to be bound to a parsed result, which seems to be the • rats! – a PEG-based parser generator[9] for Java. Java gram- addition of symbols in Nez. Data-dependent grammars are more mar is derived from a part of the xtc implementation, while expressive in a way that they express more complex condition such JavaScript parser is generated from a Nez grammar js.nez by as ([n>0] OCTET{n:=n1}), while Nez is more declarative and the grammar translation. leads to better readability. In addition, the better expressiveness of data-dependent grammars suggests a future promising extension of • pegjs – a PEG-based parser generator[24] for JavaScript. Nez, including numerical values in the symbol table. JavaScript grammar is derived from its official site, and Java Among many formal grammars such as LALR(k) and LL(k), grammar is translated from java.nez by Nez tools. PEGs are simple and seemingly a suitable foundation for imple- menting portable parser runtime. In the reminder of this section, Figure 13 shows the parsing time of processing, jquery, and we would like to focus on PEG-specific related work. java20 in each of the examined parsers. All Nez parsers show bet- Extensions. With Rats!, Robert Grimm has shown an ele- ter performance with competitive parsers in each of the host en- gant integration of PEGs with action code despite its speculative vironments. In ANTLR4 and Rats!, some of results are surpris- parsing[9]. Many other PEG-based parser generators[10, 24, 31] ingly bad. We run fairly in terms of the time measurement and the have adopted the semantic action approach for expressing syntax JIT condition, but we are still not sure that Rats! and ANTLR4 constructs that PEGs can not recognize. Yet, there have been sev- parsers are best optimized in the experiment. As a result, we con- eral declarative attempts for PEG extensions. Adams has recently clude that Nez parsers achieve competitive performance compared formalized Indent-Sensitive CFGs [1] for the indentation-based to these practical parser generators. Likewise, we compare Nez code layout and has extended the PEG-version of IS-CFGs [2]. parsers for xmark with standard XML parsers such as libxml and Iwama et al have extended PEGs with the e U e operator to com- xerces, which are highly optimized by hand. Figure 14 shows the bine a black-boxed parser for a natural language[12]. Compared to comparison of parsing time in xmark. The data points match and these existing extensions, Nez more broadly covers various parsing parse in XML parsers correspond to SAX and DOM parsers, re- aspects. spectively. Although we have observed the overhead of AST con- AST Construction. Traditionally, a grammar developer usually struction in Nez parsers, they are also competitive. writes action code to construct some forms of ASTs as a seman- Finally, we focus on the performance effect of symbol opera- tic value[15]. However, writing the AST construction is tedious, tors. Figure 15 shows the performance comparison of xml.nez and and many parser generators have some annotation-based supports xmlsym.nez. We confirm that the costs of symbol operators, requir- to automate the AST construction. The underlying idea is to fil- ing Unpublished, draft paper to be revised. Comments and suggestions are welcome 11 2015/11/30 siveness of Nez with extensive case studies. To achieve the aim of [10] C. Hirsch and D. Frey. Parsing expression grammar template library. open grammars, a lot of interesting challenges remain. Future work https://code.google.com/p/pegtl/, 2014. that we will investigate includes grammar modularity and infer- [11] R. Ierusalimschy. A text pattern-matching tool based on parsing ence, a tree checker for better connectability of parser applications, expression grammars. Softw. Pract. Exper., 39(3):221–258, Mar. 2009. and a DFA-based machine for more efficient parsing. In the end, ISSN 0038-0644. . URL http://dx.doi.org/10.1002/spe. we hope that Nez parsers will be available in many programming v39:3. languages. Our developed tools and grammars are available online [12] F. Iwama, T. Nakamura, and H. Takeuchi. Constructing parser for at http://nez-peg.github.io/. industrial software specifications containing formal and natural lan- guage description. In Proceedings of the 34th International Confer- ence on Software Engineering, ICSE ’12, pages 1012–1021, Piscat- Acknowledgments away, NJ, USA, 2012. IEEE Press. ISBN 978-1-4673-1067-3. URL The C version of Nez parsers (cnez and virtual parsing ma- http://dl.acm.org/citation.cfm?id=2337223.2337353. chines) have been developed by Masahiro Ide and Shun Honda.The [13] T. Jim, Y. Mandelbaum, and D. Walker. Semantics and algorithms JavaScript version has been developed by Takeru Sudo. Finally, the for data-dependent grammars. In Proceedings of the 37th Annual anonymous reviewers are acknowledged for their helpful sugges- ACM SIGPLAN-SIGACT Symposium on Principles of Programming tions for improving an earlier draft of this paper. Languages, POPL ’10, pages 417–430, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-479-9. . URL http://doi.acm.org/ 10.1145/1706299.1706347. References [14] S. C. Johnson and R. Sethi. Unix vol. ii. chapter Yacc: A Parser [1] M. D. Adams. Principled parsing for indentation-sensitive languages: Generator, pages 347–374. W. B. Saunders Company, Philadelphia, Revisiting landin’s offside rule. In Proceedings of the 40th Annual PA, USA, 1990. ISBN 0-03-047529-5. URL http://dl.acm.org/ ACM SIGPLAN-SIGACT Symposium on Principles of Programming citation.cfm?id=107172.107192. Languages, POPL ’13, pages 511–522, New York, NY, USA, 2013. [15] A. Johnstone and E. Scott. Tear-insert-fold grammars. In Proceedings ACM. ISBN 978-1-4503-1832-7. . URL http://doi.acm.org/ of the Tenth Workshop on Language Descriptions, Tools and Appli- 10.1145/2429069.2429129. cations, LDTA ’10, pages 6:1–6:8, New York, NY, USA, 2010. ACM. [2] M. D. Adams and O. S. Agacan.˘ Indentation-sensitive parsing for ISBN 978-1-4503-0063-6. . URL http://doi.acm.org/10.1145/ parsec. In Proceedings of the 2014 ACM SIGPLAN Symposium on 1868281.1868287. Haskell, Haskell ’14, pages 121–132, New York, NY, USA, 2014. [16] M. Jonnalagedda, T. Coppey, S. Stucki, T. Rompf, and M. Odersky. ACM. ISBN 978-1-4503-3041-1. . URL http://doi.acm.org/ Staged parser combinators for efficient data processing. In Proceed- 10.1145/2633357.2633369. ings of the 2014 ACM International Conference on Object Oriented [3] A. Afroozeh and A. Izmaylova. One parser to rule them all. In 2015 Programming Systems Languages & Applications, OOPSLA ’14, ACM International Symposium on New Ideas, New Paradigms, and pages 637–653, New York, NY, USA, 2014. ACM. ISBN 978-1- Reflections on Programming and Software (Onward!), Onward! 2015, 4503-2585-1. . URL http://doi.acm.org/10.1145/2660193. pages 151–170, New York, NY, USA, 2015. ACM. ISBN 978-1- 2660241. 4503-3688-8. . URL http://doi.acm.org/10.1145/2814228. [17] L. C. Kats, E. Visser, and G. Wachsmuth. Pure and declarative syntax 2814242. definition: Paradise lost and regained. In Proceedings of the ACM [4] M. Bravenboer and E. Visser. Concrete syntax for objects: Domain- International Conference on Object Oriented Programming Systems specific language embedding and assimilation without restrictions. Languages and Applications, OOPSLA ’10, pages 918–932, New In Proceedings of the 19th Annual ACM SIGPLAN Conference on York, NY, USA, 2010. ACM. ISBN 978-1-4503-0203-6. . URL Object-oriented Programming, Systems, Languages, and Applica- http://doi.acm.org/10.1145/1869459.1869535. tions, OOPSLA ’04, pages 365–383, New York, NY, USA, 2004. [18] L. C. L. Kats, R. Vermaas, and E. Visser. Integrated language defini- ACM. ISBN 1-58113-831-8. . URL http://doi.acm.org/10. tion testing: Enabling test-driven language development. In K. Fisher, 1145/1028976.1029007. editor, Proceedings of the 26th Annual ACM SIGPLAN Conference on [5] M. Bravenboer, E. Tanter, and E. Visser. Declarative, formal, and Object-Oriented Programming, Systems, Languages, and Applications extensible syntax definition for aspectj. In Proceedings of the 21st (OOPSLA 2011), Portland, Oregon, USA, 2011. ACM. Annual ACM SIGPLAN Conference on Object-oriented Programming [19] P. Klint, T. van der Storm, and J. Vinju. On the impact of dsl tools Systems, Languages, and Applications, OOPSLA ’06, pages 209–228, on the maintainability of language implementations. In Proceedings New York, NY, USA, 2006. ACM. ISBN 1-59593-348-4. . URL of the Tenth Workshop on Language Descriptions, Tools and Applica- http://doi.acm.org/10.1145/1167473.1167491. tions, LDTA ’10, pages 10:1–10:9, New York, NY, USA, 2010. ACM. [6] M. Bravenboer, K. T. Kalleberg, R. Vermaas, and E. Visser. Stratego/xt ISBN 978-1-4503-0063-6. . URL http://doi.acm.org/10.1145/ 0.17. a language and toolset for program transformation. Sci. Comput. 1868281.1868291. Program., 72(1-2):52–70, June 2008. ISSN 0167-6423. . URL [20] K. Kuramitsu. Konoha: implementing a static scripting language with http://dx.doi.org/10.1016/j.scico.2007.11.003. dynamic behaviors. In S3 ’10: Workshop on Self-Sustaining Systems, [7] B. Ford. Packrat parsing:: Simple, powerful, lazy, linear time, func- pages 21–29, New York, NY, USA, 2010. ACM. ISBN 978-1-4503- tional pearl. In Proceedings of the Seventh ACM SIGPLAN Interna- 0491-7. . tional Conference on Functional Programming, ICFP ’02, pages 36– [21] K. Kuramitsu. Packrat parsing with elastic sliding window. Journal of 47, New York, NY, USA, 2002. ACM. ISBN 1-58113-487-8. . URL Infomration Processing, 23(4):505–512, 2015. . http://doi.acm.org/10.1145/581478.581483. [22] K. Kuramitsu. Fast, flexible, and declarative consturction of abstract [8] B. Ford. Parsing expression grammars: A recognition-based syntactic syntax trees with PEGs. Journal of Infomration Processing, 24(1):(to foundation. In Proceedings of the 31st ACM SIGPLAN-SIGACT Sym- appear), 2016. posium on Principles of Programming Languages, POPL ’04, pages 111–122, New York, NY, USA, 2004. ACM. ISBN 1-58113-729-X. . [23] K. Kuramitsu and T. Matsumura. A declarative extension of parsing URL http://doi.acm.org/10.1145/964001.964011. expression grammars for recognizing most programming languages. Journal of Infomration Processing, 24(2):(to appear), 2016. [9] R. Grimm. Better extensibility through modular syntax. In Pro- ceedings of the 2006 ACM SIGPLAN Conference on Programming [24] D. Majda. Peg.js - parser generator for javascript, 2015. URL Language Design and Implementation, PLDI ’06, pages 38–51, New http://http://pegjs.org/. York, NY, USA, 2006. ACM. ISBN 1-59593-320-4. . URL http: [25] S. McPeak and G. Necula. Elkhound: A fast, practical glr parser //doi.acm.org/10.1145/1133981.1133987. generator. In E. Duesterwald, editor, Compiler Construction, volume Unpublished, draft paper to be revised. Comments and suggestions are welcome 12 2015/11/30 2985 of Lecture Notes in Computer Science, pages 73–88. Springer Berlin Heidelberg, 2004. ISBN 978-3-540-21297-3. [26] S. Medeiros and R. Ierusalimschy. A parsing machine for pegs. In Proceedings of the 2008 Symposium on Dynamic Languages, DLS ’08, pages 2:1–2:12, New York, NY, USA, 2008. ACM. ISBN 978- 1-60558-270-2. . URL http://doi.acm.org/10.1145/1408681. 1408683. [27] S. Medeiros, F. Mascarenhas, and R. Ierusalimschy. Left recursion in parsing expression grammars. Sci. Comput. Program., 96(P2):177– 190, Dec. 2014. ISSN 0167-6423. . URL http://dx.doi.org/10. 1016/j.scico.2014.01.013. [28] T. Parr. The reuse of grammars with embedded semantic actions. In Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension, ICPC ’08, pages 5–10, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978-0-7695-3176-2. . URL http://dx.doi.org/10.1109/ICPC.2008.36. [29] T. Parr and K. Fisher. Ll(*): The foundation of the antlr parser generator. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pages 425–436, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0663- 8. . URL http://doi.acm.org/10.1145/1993498.1993548. [30] T. Parr, S. Harwell, and K. Fisher. Adaptive ll(*) parsing: The power of dynamic analysis. In Proceedings of the 2014 ACM In- ternational Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’14, pages 579–598, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2585-1. . URL http://doi.acm.org/10.1145/2660193.2660202. [31] R. R. Redziejowski. Parsing expression grammar as a primitive recursive-descent parser with backtracking. Fundam. Inf., 79(3-4): 513–524, Aug. 2007. ISSN 0169-2968. URL http://dl.acm.org/ citation.cfm?id=2367396.2367415. [32] A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. Xmark: A benchmark for xml data management. In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02, pages 974–985. VLDB Endowment, 2002. URL http://dl.acm.org/citation.cfm?id=1287369.1287455. [33] E. Visser. Syntax Definition for Language Prototyping. PhD thesis, University of Amsterdam, September 1997. Unpublished, draft paper to be revised. Comments and suggestions are welcome 13 2015/11/30