Journal of Visual Languages and Computing 49 (2018) 17–28

Contents lists available at ScienceDirect

Journal of Visual Languages and Computing

journal homepage: www.elsevier.com/locate/jvlc

Error recovery in expression grammars through labeled failures and T its implementation based on a parsing machine ⁎ Sérgio Queiroz de Medeiros ,a, Fabio Mascarenhasb a School of Science and Technology, UFRN, Natal, Brazil b Department of Computer Science, UFRJ, Rio de Janeiro, Brazil

ARTICLE INFO ABSTRACT

Keywords: Parsing Expression Grammars (PEGs) are a formalism used to describe top-down parsers with backtracking. As Parsing PEGs do not provide a good error recovery mechanism, PEG-based parsers usually do not recover from syntax Error recovery errors in the input, or recover from syntax errors using ad-hoc, implementation-specific features. The lack of Parsing expression grammars proper error recovery makes PEG parsers unsuitable for use with Integrated Development Environments (IDEs), Parsing machine which need to build syntactic trees even for incomplete, syntactically invalid programs. Natural semantics We discuss a conservative extension, based on PEGs with labeled failures, that adds a syntax error recovery mechanism for PEGs. This extension associates recovery expressionsto labels, where a label now not only reports a syntax error but also uses this recovery expression to reach a synchronization point in the input and resume parsing. We give an operational semantics of PEGs with this recovery mechanism, as well as an operational semantics for a parsing machinethat we can translate labeled PEGs with error recovery to, and prove the cor- rectness of this translation. We use an implementation of labeled PEGs with error recovery via a parsing machine to build robust parsers, which use different recovery strategies, for the Lua language. We evaluate the effec- tiveness of these parsers, alone and in comparison with a Lua parser with automatic error recovery generated by ANTLR, a popular parser generator .

1. Introduction backtrack and try another alternative. While PEGs cannot use error handling techniques that are often applied to predictive top-down Parsing Expression Grammars (PEGs) [1] are a formalism for de- parsers, because these techniques assume the parser reads the input scribing the syntax of programming languages. We can view a PEG as a without backtracking [2,3], some techniques for correctly reporting formal description of a top-down parser for the language it describes. syntactic errors in PEG parsers have been proposed, such as tracking the PEGs have a concrete syntax based on the syntax of regexes, or extended position of the farthest failure [2] and labeled failures [4,5]. regular expressions. Unlike Context-Free Grammars (CFGs), PEGs avoid While these error reporting techniques improve the quality of error ambiguities in the definition of the grammar’s language due to the use reporting in PEG parsers, they all assume that the parser aborts after of an ordered choice operator. reporting the first syntax error. We believe this is acceptable for alarge More specifically, a PEG can be interpreted as the specification ofa class of parsers, however Integrated Development Environments (IDEs) recursive descent parser with restricted (or local) backtracking. This often require parsers that can recover from syntax errors and build means that the alternatives of a choice are tried in order; when the first Abstract Syntax Trees (ASTs) even for syntactically invalid programs, in alternative recognizes an input prefix, no other alternative of this other to conduct further analyses necessary for IDE features such as choice is tried, but when an alternative fails to recognize an input auto-completion. These parsers should also be fast, as the user of an IDE prefix, the parser backtracks to try the next alternative. expects an almost instantaneous feedback. A naive interpretation of PEGs is problematic when dealing with Some PEG parser generators already provide ad-hoc mechanisms inputs with syntactic errors, as a failure during parsing an input is not that can be exploited to perform error recovery1, but the mechanisms necessarily an error, but can be just an indication that the parser should are specific to each implementation, tying the grammar to aspecific

⁎ Corresponding author. E-mail addresses: [email protected] (S. Queiroz de Medeiros), [email protected] (F. Mascarenhas). 1 See threads https://lists.csail.mit.edu/pipermail/peg/2014-May/000612.html and https://lists.csail.mit.edu/pipermail/peg/2017-May/000719.html of the PEG mailing list. https://doi.org/10.1016/j.jvlc.2018.10.003 Received 11 October 2018; Accepted 15 October 2018 Available online 16 October 2018 1045-926X/ © 2018 Elsevier Ltd. All rights reserved. S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28 implementation. To address this issue, we proposed in a prior work a Prog ← PUBLIC CLASS NAME LCUR PUBLIC STATIC VOID MAIN conservative extension of PEGs, based on labeled failures, that adds a LPAR STRING LBRA RBRA NAME RPAR BlockStmt RCUR recovery mechanism to the PEG formalism itself [6]. The mechanism BlockStmt ← LCUR (Stmt)∗ RCUR attaches recovery expressions to labels so that throwing those labels not Stmt ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / only reports syntax errors but also skips the erroneous input until BlockStmt reaching a synchronization point and resuming parsing. IfStmt ← IF LPAR Exp RPAR Stmt (ELSE Stmt / ε) In this paper we present a parsing machine for PEGs that can im- WhileStmt ← WHILE LPAR Exp RPAR Stmt plement our error recovery approach. Each PEG is translated to a pro- DecStmt ← INT NAME (ASSIGN Exp / ε) SEMI gram that is executed by a virtual parsing machine. A formal semantics of such a parsing machine for regular PEGs already exists [7], as well as AssignStmt ← NAME ASSIGN Exp SEMI an extension of this semantics for labeled PEGs [8]. We extend this PrintStmt ← PRINTLN LPAR Exp RPAR SEMI semantics with farthest failure tracking and error recovery, and prove Exp ← RelExp (EQ RelExp)∗ the correctness of our translation of labeled PEGs with recovery to this RelExp ← AddExp (LT AddExp)∗ extended machine. AddExp ← MulExp ((PLUS / MINUS) MulExp)∗ We use an implementation of this extended parsing machine to MulExp ← AtomExp ((TIMES / DIV) AtomExp)∗ build robust parsers, which use different recovery strategies, for the Lua AtomExp ← LPAR Exp RPAR / NUMBER / NAME language. Then we compare the error recovery behavior of these par- sers with a Lua parser generated with ANTLR [9,10], a popular parsing Fig. 1. A PEG for a tiny subset of Java. tool based on a top-down approach, by evaluating the AST built by each one. This comparison is broader than the comparison done previously [6], since that it implements different recovery strategies based on la- when the input matches p, not consuming any input in either case; we beled failures, and is more precise too, because it compares ASTs di- call it the negative predicate or the lookahead predicate. rectly instead of comparing error messages. Fig. 1 shows a PEG for a tiny subset of Java, where lexical rules In this extended version of our previous work [6], we also present a (shown in uppercase) have been elided. While simple (this PEG is more detailed discussion about other error recovery approaches, and equivalent to an LL(1) CFG), this subset is already rich enough to show we provide links for our implementations. the problems of PEG error reporting; a more complex grammar for a The remainder of this paper is organized as follows: the next section larger language just compounds these problems. (Section 2) revisits the error handling problem in PEG parsers and in- Fig. 2 is an example of Java program with two syntax errors (a troduces the semantics of labeled PEGs that supports syntactic error missing semicolon at the end of line 7, and an extra semicolon at the recovery; Section 3 discusses error recovery strategies that PEG-based end of line 8). A predictive top-down parser will detect the first error RCUR } parsers can implement using our recovery mechanism; Section 4 gives when reading the ( ) token at the beginning of line 8, and will the semantics of the extended parsing machine for labeled PEGs with know and report to the user that it was expecting a semicolon. SEMI failure tracking and error recovery; Section 5 evaluates our error re- In the case of our PEG, it will still fail when trying to parse the covery approach by comparing PEG-based parsers for the Lua language rule, which should match a ‘;’, while the input has a closing curly with an ANTLR-generated parser; Section 6 discusses related work on bracket, but as a failure does not guarantee the presence of an error the error recovery for top-down parsers with backtracking; finally, parser cannot report this to the user. Failure during parsing of a PEG Section 7 gives some concluding remarks. usually just means that the PEG should backtrack and try a different alternative in an ordered choice, or end a repetition. For example, three 2. PEGs with error recovery failures will occur while trying to match the BlockStmt rule inside Prog against the n at the beginning of line 3, first against IF in the IfStmt WHILE In this section, we revisit the problem of error handling in PEGs, and rule, then against in the WhileStmt rule, and finally against PRINTLN show how labeled failures [4,5] combined with the farthest failure in the PrintStmt rule. heuristic [2] can improve the error messages of a PEG-based parser. After all the failing and backtracking, the PEG in our example will RCUR Then we show how labeled PEGs can be the basis of an error recovery ultimately fail in the rule of the initial BlockStmt, after consuming main mechanism for PEGs, and show an extension of previous semantics for only the first two statements of the body of . Failing to match the SEMI labeled PEGs that adds recovery expressions. in AssignStmt against the closing curly bracket in the input will make the PEG backtrack to the beginning of the statement to try the 2.1. PEGs and error reporting other alternatives in Stmt, which also fail. This marks the end of the repetition inside the BlockStmt that is parsing the body of the while statement. The whole BlockStmt will fail trying to match RCUR against A PEG G is a tuple (V, T, P, pS) where V is a finite set of non- n terminals, T is a finite set of terminals, P is a total function from non- the in the beginning of line 7, this ultimately makes the whole WhileStmt fail, which makes the PEG backtrack to the beginning of line terminals to parsing expressions and pS is the initial parsing expression. We describe the function P as a set of rules of the form A ← p, where A ∈ V and p is a parsing expression. A parsing expression, when applied 1 public class Example { to an input string, either fails or consumes a prefix of the input and 2 public static void main(String[] args) { returns the remaining suffix. The abstract syntax of parsing expressions 3 int n = 5; is given as follows, where a is a terminal, A is a non-terminal, and p, p1 4 int f = 1; and p2 are parsing expressions: 5 while(0 < n) { p= | a | A | p p | p / p | p * | ! p 6 f = f * n; 1 2 1 2 7 n = n - 1 Intuitively, ε successfully matches the empty string, not changing 8 }; the input; a matches and consumes itself or fails otherwise; A tries to 9 System.out.println(f); match the expression P(A); p1p2 tries to match p1 followed by p2; p1/p2 10 } tries to match p1; if p1 fails, then it tries to match p2; p* repeatedly 11 } matches p until p fails, that is, it consumes as much as it can from the input; the matching of !p succeeds if the input does not match p and fails Fig. 2. A Java program with a syntax error.

18 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28

5. Now the process repeats with the BlockStmt that is parsing the body approaches, and still track the position, and set of expected lexical of main. rules, of the furthest simple failure. The parser can fall back on auto- In the end, the PEG will report that it failed and cannot proceed at matically generated error messages whenever parsing fails without the beginning of line 5, complaining that the while in the input does producing a more specific error label. not match the RCUR that it expects, which does not help the pro- grammer in finding and fixing the actual error. To circumvent this problem, Ford [2] suggested that the furthest 2.2. Error recovery position in the input where a failure has occurred should be used for reporting an error. A similar approach for top-down parsers with The labeled PEGs with farthest failure tracking we described in the backtracking was also suggested by Grune and Jacobs [11]. previous section make it easier to report the first syntax error found ina In our previous example, the use of the farthest failure approach PEG, and we will use them as the first step towards an error recovery reports an error at the beginning of line 8, the same as a predictive mechanism. parser would. We can even use a map of lexical rules to token names to Before giving the full formal definition of PEGs with error recovery, track expected tokens in the error position, and report that a semicolon let us return to the example program in Fig. 2 and its two syntax errors: was expected. a missing semicolon at the end of line 7, and an extra semicolon at the semia If the programmer fixes this error, the parser will then fail re- end of line 8. The labeled PEG of Fig. 3 throws the label when it peatedly at the extra semicolon at line 8, while trying to match the first finds the first error, and finishes parsing. term of all the alternatives of Stmt. This will end the repetition inside If every syntactic error is labeled, to recover from them we need to BlockStmt, and then another failure will happen when trying to match a do the following: first, catch the label right after it is thrown, before the RCUR token against the semicolon, finally aborting the parse. The parser parser aborts, then log this error, possibly skip part of the input, and can use the furthest failure information to report an error at the exact finally resume parsing. In our example, for the first error we justneedto position of the semicolon, and a list of expected tokens that includes IF, log it and continue as if the semicolon was found, and for the second WHILE, NAME, LCUR, PRINTLN, and RCUR. error we need to log the error, skip until finding the end of a block The great advantage of using the farthest failure is that the grammar (taking care with nested blocks on the way), and then resume. writer does not need to do anything to get a parser with better error To achieve this, we extend labeled PEGs with a list of recovered reporting, as the error messages can be generated automatically. errors and a map of labels to recovery expressions. These recovery ex- However, although this approach gives us error messages with a fine pressions are responsible for skipping tokens in the input until finding a approximation of the error location, these messages may not give a place where parsing can continue. Fig. 4 presents the semantics of labeled PEGs with error recovery as good clue about how to fix the error, and may contain a long listof PEG a set of inference rules for a function. The notation expected tokens [5]. PEG We can get more precise error messages at the cost of manually G[](,?,) p R xy y v E represents a successful match of the parsing annotating the PEG with labeled failures, a conservative extension of the expression p in the context of a PEG G against the subject xy with a map PEG formalism. A labeled PEG G is a tuple (V , T , P , L ,fail, pS ) where L R from labels to recovery expressions, consuming x and leaving the is a finite set of labels, fail L is a failure label, and the expressions in P suffix y. The term v? is information for tracking the location of the have been extended with the throw operator, represented by ⇑. The furthest failure, and denotes either a suffix v of the original input or nil. l parsing expression ⇑ , where l ∈ L, generates a failure with label l. E is a list of pairs of a label and a suffix of the original input, denoting 2 A label l fail thrown by ⇑ cannot be caught by an ordered choice , errors that were logged and recovered. For an unsuccessful match the so it indicates an actual error during parsing, while fail is caught by a first element of the resulting triple is l, f, or fail, denoting a label. choice and indicates that the parser should backtrack. The lookahead The auxiliary function smallest that appears on Fig. 4 compares operator ! captures any label and turns it into a success, while turning a two possible error positions, denoted by a suffix of the input string, or success into a fail label. nil if no failure has occurred, and returns the furthest: any suffix of We can map different labels to different error messages, and then the input is a further possible error position than nil and a shorter annotate our PEG with these labels. Fig. 3 annotates the PEG of Fig. 1 suffix is a further possible error position than a longer suffix. l (except for the Prog rule). The expression [p] is syntactic sugar for (p / Most of the rules are conservative extensions of the rules for labeled l ⇑ ). PEGs [5,8], where the recovery map R is simply passed along, and any The strategy we used to annotate the grammar was the following: on lists of recovered errors are concatenated. The exceptions are the rules the right-hand side of a production, we annotate every symbol (term- for the syntactic predicate and for throwing labels. inal or non-terminal) that should not fail, that is, making the PEG The syntactic predicate turns any failure label into a success, using backtrack on failure of that symbol would be useless, as the whole parse an empty recovered map to make sure that errors are not recovered would either fail or not consume the whole input in that case. For an LL inside the predicate. Failure tracking information is also thrown away. (1) grammar like the one in our example, that means all symbols in the In essence, any error that happens inside a syntactic predicate is ex- right-hand side of a production except the one in the very beginning of pected, and not considered a syntax error in the input. the production. We apply a similar rule when the right-hand side has a Rule throw.1 is related to error reporting, while rules throw.2 and choice or a repetition as a subexpression. throw.3 are where error recovery happens. R(l) denotes the recovery Using this labeled PEG in our program, the first syntax error now expression associated with the label l. When a label l is thrown we check fails directly with a semia label, which we can map to a “missing if R has a recovery expression associated with it. If it does not semicolon in assignment” message. If the programmer fixes this, the (throw.1), we just append the label and current position to E and second error will fail with a rcblk label, which we can map to a propagate the error upwards, so parsing finishes after reaching the first “missing end of block” message. syntactical error. Compared with the farthest failure approach, one drawback of la- If label l has a recovery expression R(l), we append the current error beled failures is the annotation burden. But we can combine both to the list L of errors and try to match the current input by using R(l). Rule throw.2 deals with the case where the matching of R(l) succeeds, and rule throw.3 deals with the case where R(l) fails. Intuitively, rule 2 This is a simplification of the original formalization of labeled failures, throw.2 applies when the parser recovers from an error and regular where a new labeled ordered choice expression could catch labels; labeled or- parsing is resumed after, while we use rule throw.3 when another error dered choice is not necessary when using labels for error reporting. (or a regular failure) happens during recovery, which may finish the

19 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28

Prog ← PUBLIC CLASS NAME LCUR PUBLIC STATIC VOID MAIN LPAR STRING LBRA RBRA NAME RPAR BlockStmt RCUR BlockStmt ← LCUR (Stmt)∗ [RCUR]rcblk Stmt ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / BlockStmt IfStmt ← IF [LPAR]lpif [Exp]condi [RPAR]rpif [Stmt]then (ELSE [Stmt]else / ε) WhileStmt ← WHILE [LPAR]lpw [Exp]condw [RPAR]rpw [Stmt]body DecStmt ← INT [NAME]ndec (ASSIGN [Exp]edec / ε)[SEMI]semid AssignStmt ← NAME [ASSIGN]assign [Exp]rval [SEMI]semia PrintStmt ← PRINT [LPAR]lpp [Exp]eprint [RPAR]rpp [SEMI]semip Exp ← RelExp (EQ [RelExp]relexp)∗ RelExp ← AddExp (LT [AddExp]addexp)∗ AddExp ← MulExp ((PLUS / MINUS)[MulExp]mulexp)∗ MulExp ← AtomExp ((TIMES / DIV)[AtomExp]atomexp)∗ AtomExp ← LPAR [Exp]parexp [RPAR]rpe / NUMBER / NAME

Fig. 3. A PEG with labels for a small subset of Java. parsing or not (the parser can still recover from this second error). symbols, as they should be consumed by the following expression. In our example from Fig. 3, we can recover from a semia error (as The recovery expression above could be automatically computed well as semip and semid) by using ε as its recovery expression, i.e., an from FOLLOW(Exp) and associated with labels condi, condw, edec, expression that matches the empty string and thus always succeeds. rval, eprint, and parexp. Another option is to compute a specific This is similar to making semicolons optional in the grammar, but re- FOLLOW set for each use of Exp. For example, the FOLLOW set asso- cording that the semicolon was not found instead of just ignoring the ciated with Exp in rules DecStmt and AssignStmt contains only SEMI, issue. while the FOLLOW set associated with Exp in IfStmt, WhileStmt, For the rcblk error, we could extend the grammar with the fol- AtomExp, and PrintStmt contains only RPAR. lowing auxiliary rule and use non-terminal SkipToRCUR as the recovery Fig. 5 extends our annotated PEG for a subset of Java with recovery expression of rcblk: rules computed automatically. For each expression of the form [p]l, we SkipToRCUR (!RCUR (LCUR SkipToRCUR / .))* RCUR added a recovery rule of the form l ← (!FOLLOW(p) .)*. In Fig. 5, the symbol EOF represents the end of input. The recovery expression as- This rule skips all tokens until finding and consuming a ‘}’ (RCUR) sociated with labels then, else, body, semid, semia, and semip is token, or reaching the end of input, taking care to correctly account for the same. For the sake of brevity, Fig. 5 shows the recovery rule for nested blocks. One drawback of this recovery expression is that it will label then and omits the recovery rules for these other labels. In a make the parser ignore anything from the point of the error to the similar way, this figure shows only the recovery rule for label condi closing brace, including any errors in that part of the input. and omits the recovery rules for condw, eprint, and parexp. This use Although our formalization of labeled failures uses a map R to as- of the FOLLOW set (possibly enhanced by some usual synchronization sociate labeled failures to recovery expressions, we could also have symbols, such as ‘;’) provides a default error recovery strategy. adapted function P. In this alternative formalization, P would have also Let us consider that the Java program from Fig. 2 has an error on recovery rules of the form l ← p, where l ∈ L and p is a labeled parsing line 5, inside the condition of while loop, as follows: expression. Our default error recovery strategy will report this error and resume In the next section, we will discuss error recovery strategies for parsing correctly at the following right parenthesis. In the resulting PEGs, and how we can modify the grammar to improve recovery of AST, the node for the while loop will have an empty condition, so we rcblk errors. lose the node corresponding to the use of the n variable, and the in- formation that the condition was a < expression. 3. Error recovery strategies for PEGs Now let us consider the default error recovery strategy for label rcblk.A BlockStmt can be followed by a Statement or by RCURL, or it A parser with a good recovery mechanism is essential for use in an can be the last statement of a program. Therefore, as we can see in rcblk IDE, where we want an AST that captures as much information as Fig. 5, the default recovery expression for synchronizes with a else ‘}’ possible about the program even in the presence of syntax errors due to token that indicates the beginning of a Statement, an block, a , an unfinished program. or the end of input. rcblk We can improve the error recovery quality of a PEG parser by using Unfortunately, this recovery strategy is not good for , as our the FIRST and FOLLOW sets of parsing expressions when throwing la- example program from Fig. 2 shows. The recovery expression for rcblk bels or recovering from an error. A detailed discussion about FIRST and will consume the ‘;’ at the end of line 8, and then stop at the FOLLOW sets in the context of PEGs can be found in other papers beginning of the next statement on line 9. But the parser just closed the [12–14]. BlockStmt of the main function of the program, and now expects another In our grammar for a subset of Java, we can see that whenever rule ‘;’ to close the class body in Prog. This will lead to a spurious error when Exp is used it should be followed by either a right parenthesis or a the parser finds the beginning of the print statement, so acustom rcblk semicolon, so we could define (!(RPAR / SEMI) . )* as a recovery ex- SkipToRCUR recovery expression is a better way to deal with pression, based on the FOLLOW set of Exp. Differently from the rcblk errors. recovery expression, this one does not consume the synchronization While SkipToRCUR avoids spurious errors, it does have the potential

20 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28

Empty (empty.1) PEG G[ε] R x (x, nil, [])

PEG PEG G[P (A)] R xy (y, v?,E) G[P (A)] R x (l, v?,E) Non-terminal (var.1) (var.2) PEG PEG G[A] R xy (y, v?,E) G[A] R x (l, v?,E)

b 6= a Terminal (term.1) (term.2) (term.3) PEG PEG PEG G[a] R ax (x, nil, []) G[b] R ax (fail, ax, []) G[a] R ε (fail, ε, [])

PEG PEG G[p ] R xyz (yz, v?,E ) G[p ] R yz (z, w?,E ) Sequence 1 1 2 2 (seq.1) PEG G[p1 p2] R xyz (z, smallest(v?, w?),E1 ++ E2) PEG PEG G[p ] R xyz (yz, v?,E ) G[p ] R yz (fail, w?,E ) 1 1 2 2 (seq.2) PEG G[p1 p2] R xyz (fail, smallest(v?, w?),E1 ++ E2) PEG PEG PEG G[p ] R xyz (yz, v?,E ) G[p ] R yz (l, z, E ) l 6= fail G[p ] R x (l, v?,E) 1 1 2 2 (seq.3) 1 (seq.4) PEG PEG G[p1 p2] R xyz (l, z, E1 ++ E2) G[p1 p2] R x (l, v?,E)

PEG PEG PEG G[p] R x (fail, v?,E) G[p] R xyz (yz, v?,E ) G[p∗] R yz (z, w?,E ) Repetition (rep.1) 1 2 (rep.2) PEG PEG G[p∗] R x (x, v?,E) G[p∗] R xyz (z, smallest(v?, w?),E1 ++ E2) PEG PEG PEG G[p] R xy (l, y, E) l 6= fail G[p] R xyz (yz, v?,E ) G[p∗] R yz (l, z, E ) l 6= fail (rep.3) 1 2 (rep.4) PEG PEG G[p∗] R xy (l, y, E) G[p∗] R xyz (l, smallest(v?, z),E1 ++ E2)

PEG PEG G[p] {} x (l, v?,E) G[p] {} xy (y, v?,E) Negative Predicate (not.1) (not.2) PEG PEG G[!p] R x (x, nil, []) G[!p] R xy (fail, nil, [])

PEG PEG G[p ] R xy (y, v?,E) G[p ] R xy (l, y, E) l 6= fail Ordered Choice 1 (ord.1) 1 (ord.2) PEG PEG G[p1 / p2] R xy (y, v?,E) G[p1 / p2] R xy (l, y, E) PEG PEG G[p ] R xy (fail, v?,E ) G[p ] R xy (fail, w?,E ) 1 1 2 2 (ord.3) PEG G[p1 / p2] R xy (fail, smallest(v?, w?),E1 ++ E2) PEG PEG G[p ] R xy (fail, v?,E ) G[p ] R xy (y, w?,E ) 1 1 2 2 (ord.4) PEG G[p1 / p2] R xy (y, smallest(v?, w?),E1 ++ E2) PEG PEG G[p ] R xy (fail, v?,E ) G[p ] R xy (l, y, E ) l 6= fail 1 1 2 2 (ord.5) PEG G[p1 / p2] R xy (l, y, E1 ++ E2)

PEG l∈ / Dom(R) G[R(l)] R xy (y, v?,E) Recovery (throw.1) (throw.2) PEG PEG G[⇑l] R x (l, x, []) G[⇑l] R xy (y, v?, (l, xy) :: L) PEG G[R(l)] R xy (f, v?,E) (throw.3) PEG G[⇑l] R xy (f, v?, (l, xy) :: L)

Fig. 4. Semantics of PEGs with labels, recovery and farthest failure tracking. to skip a large portion of the input, leading to a poor AST. We can instead of blindly skipping tokens until finding the closing ‘)’, try to improve this by noticing that Stmt inside the repetition of BlockStmt see if we have a partial relational expression before giving up with the is not allowed to fail unless the next token is RCUR, so we can replace following recovery expression for condw: Stmt with !RCUR [Stmt ]stmtb. Now the second error in our program will relexp addexp make parsing fail with a stmtb label. The recovery expression of this !!EQ(EQ[RelExp ] )*/!!LT(LT[AddExp ] )*/(!RPAR.)* label can synchronize with the beginning of the next statement, or ‘}’. In our example, this will skip the erroneous ‘;’ at the end of line 8 and The double negation is an and syntactic predicate, and is a way of then continue parsing the rest of the block. guarding an expression so it will only be tried if its beginning matches Finally, we have the full power of PEGs inside recovery expressions, the guard. and can use it for more elaborate recovery strategies. Going back to the In the next section, we discuss how to automatically annotate a PEG error in the condition of a while loop earlier in this section, we can, with labels, making it possible to automate some of the error recovery strategies that we discussed in this section.

21 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28

Prog ← PUBLIC CLASS NAME LCUR PUBLIC STATIC VOID MAIN LPAR STRING LBRA RBRA NAME RPAR BlockStmt RCUR BlockStmt ← LCUR (Stmt)∗ [RCUR]rcblk Stmt ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / BlockStmt IfStmt ← IF [LPAR]lpif [Exp]condi [RPAR]rpif [Stmt]then (ELSE [Stmt]else / ε) WhileStmt ← WHILE [LPAR]lpw [Exp]condw [RPAR]rpw [Stmt]body DecStmt ← INT [NAME]ndec (ASSIGN [Exp]edec / ε)[SEMI]semid AssignStmt ← NAME [ASSIGN]assign [Exp]rval [SEMI]semia PrintStmt ← PRINT [LPAR]lpp [Exp]eprint [RPAR]rpp [SEMI]semip Exp ← RelExp (EQ [RelExp]relexp)∗ RelExp ← AddExp (LT [AddExp]addexp)∗ AddExp ← MulExp ((PLUS / MINUS)[MulExp]mulexp)∗ MulExp ← AtomExp ((TIMES / DIV)[AtomExp]atomexp)∗ AtomExp ← LPAR [Exp]parexp [RPAR]rpe / NUMBER / NAME rcblk ← (!(RCUR / LCUR / WHILE / INT / IF / ELSE / NAME / PRINT / EOF) .)∗ lpif ← (!(NAME / NUMBER / LPAR) .)∗ condi ← (!RPAR .)∗ rpif ← (!(LCUR / WHILE / INT / IF / NAME / PRINT) .)∗ then ← (!(RPAR / LPAR / WHILE / INT / IF / ELSE / NAME / PRINT) .)∗ lpw ← (!(NAME / NUMBER / LPAR) .)∗ rpw ← (!(LCUR / WHILE / INT / IF / NAME / PRINT) .)∗ ndec ← (!(ASSIGN / SEMI) .)∗ edec ← (!SEMI .)∗ assign ← (!(NAME / NUMBER / LPAR) .)∗ rval ← (!SEMI .)∗ lpp ← (!(NAME / NUMBER / LPAR) .)∗ rpp ← (!SEMI .)∗ relexp ← (!(SEMI / RPAR) .)∗ addexp ← (!(EQ / SEMI / RPAR) .)∗ mulexp ← (!(LT / EQ / SEMI / RPAR) .)∗ atomexp ← (!(PLUS / MINUS / LT / EQ / SEMI / RPAR) .)∗ rpe ← (!(TIMES / DIV / PLUS / MINUS / LT / EQ / SEMI / RPAR) .)∗

Fig. 5. A PEG with labels and recovery rules for a small subset of Java.

4. A parsing machine for labeled PEGs next instruction to execute, and one register holds the current subject (an input suffix). The stack holds call and backtrack frames. Acall This section discusses how we can extend an implementation of frame is just a return address for the program counter, and a backtrack PEGs based on a parsing machine to provide error reporting and recovery frame is an address for the program counter and a subject to backtrack facilities. Parsing machines are a way to efficiently implement PEG- to. The semantics of each instruction is given as its effect on the reg- based parsers that can be constructed dynamically at runtime [7,15]. In isters and on the stack. parsing machine-based implementations each parsing expression is Formally, the program counter register, the subject register, and the compiled to instructions of a virtual parsing machine; the program for a stack form a machine state. We represent it as a tuple ×T* × Stack.A compound expression is a combination of the programs for its sub- machine state can also be a failure state, represented by Fail⟨e⟩, expressions. The result program can be interpreted using the same where e is the stack. Stacks are lists of (T× *), where × T* techniques that virtual machines for programming languages use, and represents a backtrack frame and represents a call frame. even Just-In-Time compiled to machine code. The performance of a The behavior of each instruction is straightforward: Call pushes a parsing machine interpreter is comparable to the performance of par- call frame with the address of the next instruction and jumps to an sers generated by other approaches [7,15,16]. instruction with a given label3, Return pops a call frame and jumps to In its abstract definition [7], the parsing machine has two registers and a stack. One register holds the program counter, used to address the 3 When discussing the parsing machine we will also use the term label to refer

22 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28 its address, Jump is an unconditional jump, Char tries to match a from labels to the (absolute) program counter addresses of recovery character with the first character of the subject, consuming it ifsuc- programs, where each program terminates in a Return instruction. We cessful and entering a failure state otherwise. When the machine enters also add a list E of recovered errors, as we had in the parsing expres- a failure state it pops call frames from the stack until reaching a sions. The failure state also records both R and E. backtrack frame, then pops this frame and resumes execution with the As the machine program associated with parsing expression !p will subject and address stored in it. Any consumes the first character of the use an empty map for R and an empty list E of recovered errors, it needs subject (failing if the subject is ε), Choice pushes a backtrack frame to push on the stack the current values of R and E in a new kind of with the current subject and the address associated with a given label, backtrack frame. The machine’s failure state halts the unwinding of the Commit discards the backtrack frame in the top of the stack and jumps machine stack when it finds this kind of frame, thus correctly im- to a given label, and Fail unconditionally enters a failure state. plementing rule not.2 of the semantics of PEGs with labels and re- As an example, the following PEG (that matches a sequence of zero covery. or more as, followed by any character other than a, followed by a b) Fig. 7 presents the new semantics of instructions Throw and uses all basic instructions of the parsing machine, where ‘.’ is a parsing Commit, as well as a new PredChoice instruction that pushes the new expression that matches any character: kind of backtrack frame for syntactic predicates. In the program that we gave earlier in this section, the Choice A!./ a B aA PredChoice B b instruction on the first line would become a . The se- mantics for Commit is extended to take into account this new kind of This PEG compiles to the following program: backtrack frame, by restoring the original values of R and E when leaving the extent of a syntactic predicate. A: ChoiceA 1 The semantics of the Throw l instruction also change, to either fail ChoiceA 2 Char a with the label l or try to recover, depending on whether the recovery CommitA 3 map R has an entry for l. In case the machine tries to recover it also logs A3: Fail the error in the E list. Notice that the address of the instruction that A2: Any follows Throw is pushed on the frame stack, so parsing can continue if Call B the recovery program finishes successfully by reaching a Return in- CommitA 4 struction. A1: Char a When the machine is in a failure state for any label, and there is Jump A backtrack frame on top of the stack that was pushed by a previous A4: Return PredChoice instruction, it uses all the information available in the B: Char b backtrack frame to construct a new machine state. Return The formal definition of the parsing machine defines the translation We want to extend the parsing machine to support error reporting of a PEG into a program of the parsing machine with a translation and recovery. Let us start by adding error reporting features. The ma- function Π, where Π(G, i, p) is the translation of parsing expression p in chine state becomes ×T* ×Stack × (T* {nil}), where the extra the context of the PEG G, with i being the position where the program (T* {nil}) term keeps the information about the farthest failure. We starts relative to the start of the translated PEG. We use the notation then update instructions Char and Any to track the farthest failure |Π(G, i, p)| to mean the number of instructions in the program Π(G, i, p). position. The only change to Π that we need is for the translation of syntactic PredChoice As we also want to keep information about the label associated with predicates, as it has to use the new instruction. When an error we will represent a failure state as Fail〈l, v?, e〉, where l is a extending Π to translate labeled PEGs we do not need to take in account label, v? represents (possibly nil) suffix of the subject associated with the recovery expressions, as we assume that the grammar G has been the farthest failure, and e is the stack. We also rename the Fail in- extended with non-terminals l for each label l with a recovery ex- P() = p struction to Throw, where Throw takes a label l as an operand. With the pression pl, where l l. It is straightforward to build the recovery exception of the Throw instruction, all instructions that can fail will map R from the offsets of the program fragments corresponding tothe fail non-terminals. produce a failure state where l is . l Fig. 6 presents the operational semantics of the parsing machine as a The common pattern p / ⇑ , which transforms a plain failure into an

relation between machine states. The program that the machine error label, becomes the following program, where represents the Instruction program associated with p: executes is implicit. The relation relates two states when pc in the first state addresses a instruction matching the name above the→, ChoiceL 1 and the guard (if present) is valid.

When the Char instruction fails, in case the subject is not ε it uses CommitL 2 L1: Throw l smallest to update the suffix associated with the farthest failure po- L2: sition, otherwise, the ε itself becomes the suffix associated with the farther failure position. The Any instruction also updates the suffix The use of labels may have an impact on the performance of the associated with the farthest failure position accordingly. resulting program for the parsing machine, because some optimizations The Throw instruction sets the current position as the position of used before the introduction of labels may not be valid anymore. We the farthest failure. Although the current position may not represent the will not discuss this topic here, as our previous paper on a parsing farthest failure, it indicates the position where label f was thrown. It is machine for labeled PEGs (without recovery) already has a detailed easy to see that right now the machine can only recover from a failure discussion [8]. state if its label is fail and it finds a backtrack frame on the top ofthe stack. 4.1. Correctness of the translation To add error recovery, we extend the machine state with a map R The following lemma gives the correctness condition for the trans- (footnote continued) formation Π with regards to the semantics of PEGs with labeled failures to a name associated with the position of an instruction in a program. Context but without recovery, by proving that a program for the parsing ma- should be enough to disambiguate the two uses of label. chine obtained using the translation function Π always produces the

23 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28

Char a hpc, ax, e, v?i −−−−−→ hpc + 1, x, e, v?i Char a hpc, bx, e, v?i −−−−−→ Failhfail, smallest(bx, v?), ei, a 6= b Char a hpc, ε, e, v?i −−−−−→ Failhfail, ε, ei Any hpc, ax, e, v?i −−−−−→ hpc + 1, x, e, v?i Any hpc, ε, e, v?i −−−−−→ Failhfail, ε, ei Choice i hpc, x, e, v?i −−−−−→ hpc + 1, x, (pc + i, x):e, v?i Jump i hpc, x, e, v?i −−−−−→ hpc + i, x, e, v?i Call i hpc, x, e, v?i −−−−−→ hpc + i, x, (pc + 1):e, v?i Return hpc1, x, pc2 :e, v?i −−−−−→ hpc2, x, e, v?i Commit i hpc1, y, (pc2, xy):e, v?i −−−−−→ hpc1 + i, y, e, v?i Throw l hpc, x, e, v?i −−−−−→ Failhl, x, ei Failhl, v?, pc:ei −−−−−→ Failhl, v?, ei Failhl, v?, (pc, x):ei −−−−−→ Failhl, v?, ei, where l =6 fail Failhfail, v?, (pc, x):ei −−−−−→ hpc, x, e, v?i

Fig. 6. Operational semantics of the parsing machine with labels.

PredChoice i hpc, x, e, v?,R,Ei −−−−−−−→ hpc + 1, x, (pc + i, x, R, E):e, v?, {}, {}i Fig. 7. Operational semantics of the parsing machine with labels and re- Commit i hpc2, y, (pc1, xy, R1,E1):e, v?,R2,E2i −−−−−→ hpc2 + i, y, e, v?,R1,E1i covery. Throw l hpc, x, e, v?,R,Ei −−−−−→ Failhl, x, e, R, Ei, where l∈ / Dom(R) Throw l hpc, x, e, v?,R,Ei −−−−−→ hR(l), x, (pc + 1):e, s, R, (x, l):Ei, where l ∈ Dom(R)

Failhl, v?, (pc, x, R1,E1):e, R2,E2i −−−−−→ hpc, x, e, v?,R1,E1i

same result as the parsing expression that produced it: induction hypothesis on the antecedent of rules not.1 and not.2, while the behavior of Commit for backtrack frames extended with R Lemma 4.1 (Correctness of . Πwith labeled failures) Given a PEG with and E gives us the conclusion for rule not.1, and the behavior of failure labels G, a labeled parsing expression p, and a subject xy, we have that PEG (,,)G i p states with these extended backtrack frames gives us the conclusion for G[] p xy y implies that pcxye,, pc+ | ( Gip , , )|, ye , , l PEG rule not.2; for ⇑ , the conclusion rule throw.1 follows directly from the and we have that G[](,) p xy l y implies that definition of Throw l, while throw.2 and throw.3 follow from the (,,)G i p pc,, xy e Fail l,,, y e where pc is the address of the first induction hypothesis on .□ instruction of Π(G, i, p).

Proof. Given in the paper that describes the parsing machine for 5. Evaluation labeled PEGs [8]. □ In order to establish the correctness of our extended machine, we In this section, we evaluate our syntax error recovery approach for need to prove the following equivalent lemma, which takes furthest PEGs using a complete parser for an existing programming language in failure tracking and recovery expressions into account: two different contexts, first in isolation and then by comparison witha parser generated by ANTLR, a mature parser generator that uses pre- Lemma 4.2 (Correctness of . Πwith labeled failures) Given a PEG with dictive parsing. labels G, a map RG where each label l that has a recovery expression is associated with a non-terminal l that has P ()l as the recovery expression, 5.1. Error recovery in a Lua parser the corresponding map R from the same labels to the program counter address of each non-terminal l, a labeled parsing expression p, and a PEG It seems there is not a consensus about how to evaluate an error subject xy, we have that G[] p RG xy(,?,) y v E implies that (,,)G i p recovery strategy. Ripley and Druseiks [17] collected a set of syntactic pc, xy , e , nil, R , [] pc+ | (,,)|,,,?, G i p y e v R , E , and we invalid Pascal programs that was used to evaluate some error recovery PEG have that G[] p RG xy(,?,) l v E implies that strategies [18–21]. However, as far as we know, this set of programs is (,,)G i p pc, xy , e , nil, R , [] Fail l,?,,,, v e R E where pc is the not publicly available. address of the first instruction of Π(G, i, p). Another issue related to the evaluation of an error recovery strategy is how to measure its quality. Pennelo and DeRemmer [18] proposed a Proof. By induction on the height of the proof trees for the antecedents. criteria based on the similarity of the program got after recovery with Most cases are similar to the ones in the proofs of the previous lemma the intended program (without syntax errors). This quality measure was and similar lemma for the original parsing machine [7]. The ones that used to evaluate several strategies [21–23], although it is arguably deviate are the cases where the expression is a syntactic predicate !p subjective [23]. l and for ⇑ , but those are still straightforward and have been elided for We will evaluate our strategy following Pennelo and DeRemmer’s brevity: for the syntactic predicate PredChoice lets us use the approach, however instead of comparing programming texts we will

24 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28 compare the AST from an erroneous program after recovery with the recovery mechanism. In a general way, each program should cause the AST of what would be an equivalent correct program. For the leaves of throwing of a specific label, to test whether the associated recovery our AST we do not require their contents to be the same, just the expression recovers well. We usually wrote more than one erroneous general type of the node, so we are comparing just the structure of the Lua program to test each label. ASTs. Table 1 shows for how many programs the recovery strategies we Based on this strategy, a recovery is excellent when it gives us an AST implemented were considered excellent, good, poor, or failed. As we can equal to the intended one. A good recovery gives us a reasonable AST, see, the use of labels plus the recovery operator enabled us to imple- i.e., one that captures most information of the original program, does ment PEG parsers for the Lua language with a robust recovery me- not report spurious errors, and does not miss other errors. A poor re- chanism. In our evaluation approach, for all three parsers more than covery, by its turn, produces an AST that loses too much information, 90% of the recovery done was considered acceptable, i.e., it was rated results in spurious errors, or misses errors. Finally, a recovery is rated as at least good. In Section 5.2, we will discuss the row named . failed whenever it fails to produce an AST at all. Our parsers were always able to build an AST, given that no re- To illustrate how we rated a recovery, let us consider the following covery expression raised an unrecoverable error, or entered a loop. syntactically invalid program, where the range start of the for loop These properties can be conservatively checked, as indicated by Ford was not given at line 2: [1]. A recovery would be excellent in case the AST has all the information We have also compared the number of error messages generated by associated with this program, where the AST may have a dummy ex- each error recovery approach for the 180 syntactically invalid Lua pression node to represent the range start. A recovery would be good in programs. Table 2 shows that the Lua parsers based on a FOLLOW-set case the resulting AST misses only the loop range end and no spurious error recovery usually report more errors than the other Lua parser. In error is reported. By its turn, a recovery would by rated as poor either in the next section we discuss the row named antlr. case the resulting AST misses the statements inside the for (lines 3 and A possible explanation for this resides in the fact that our custom 4), or it produces spurious error messages. Lastly, we would rate a re- recovery expressions use a few tokens to synchronize the parser, while covery as failed in case it would have not even produced an AST for the corresponding FOLLOW set usually has more tokens. The use of a the previous program. richer set of tokens seems to slightly improve synchronization in gen- To evaluate our error recovery strategy, we adapted an existing PEG eral, at the cost of sometimes reporting more spurious error messages. parser for the Lua programming language [24] using the LPegLabel tool For most programs, all approaches report the same amount of error [25]. Our parser was based on the parser available at https://github. messages. com/andremm/lua-parser , which targets the syntax defined in the Lua 4 5.3 reference manual , and builds the AST associated with a given 5.2. Comparison with ANTLR program. The Lua grammar of our parser based on LPegLabel has 75 different ANTLR [27,28] is a popular tool for generating top-down parsers. labels, which were added manually, and 80 expressions involving the ⇑ The repository of ANTLR at GitHub contains the implementation of operator (most labels are thrown only once). When implementing error several parsers, including a parser for Lua 5.35. We used ANTLR 4 for recovery for our Lua parser we used three different error recovery our comparison. strategies, one that required manual intervention and two others based Unlike LPegLabel, ANTLR automatically generates from a grammar on FOLLOW-set error recovery [26], which could be implemented au- description a parser with error reporting and recovery mechanisms, so tomatically. The difference between these implementations resides only the user does not need to annotate the grammar. The error recovery on the recovery expression each one associates to a label, once that all mechanism of ANTLR is based on early ideas of Niklaus Wirth [29], of them are based on the same Lua grammar annotated with 75 dif- though it has its own distinctive features. ferent labels. The implementation of our error recovering Lua parser is In general, after an error an ANTLR parser attempts to re- available at https://github.com/sqmedeiros/lua-parser/releases/tag/ synchronize by trying single token insertion and deletion. In case the comlan. remaining input can not be matched by any production of the current Initially, we defined a small set of default recovery expressions, non-terminal, the parser consumes the input “until it finds a token that based on what would be good recovery tokens for the Lua grammar, could reasonably follow the current non-terminal” [10]. ANTLR allows to and we associated one of these expressions with each label of our modify its default error recovery approach, and also allows to define a grammar. Then, while testing our recovery strategy we wrote some recovery strategy for a particular grammar rule, although this does not custom recovery expressions in order to avoid spurious error messages seem usual in ANTLR 4. A more detailed description of ANTLR 4 re- or to build a better AST. We will refer to this version of the Lua parser as covery mechanism is given in [28]. In our evaluation we used the de- custom. This version uses around 25 different recovery expressions. fault ANTLR error recovery strategy. After we implemented recovery strategies based on the use of Based on the available ANTLR parser for Lua, we implemented a FOLLOW sets, by using these sets to build the recovery expressions visitor pattern in order to get the AST associated with a given Lua associated to each label. These recovery expressions try to synchronize program. This implementation is available at https://github.com/ by using the tokens in the FOLLOW set of the current non-terminal the sqmedeiros/antlrlua. parser was matching when the label was thrown. We name as global the In Table 1, we can see that for our set of 180 syntactic invalid Lua version of the Lua parser that builds recovery expressions based on programs the error recovery done by the ANTLR parser was acceptable global FOLLOW sets, and as local to the version based on local FOLLOW for around 76% of them, while for one quarter of these files the re- sets. A non-terminal A has only one global FOLLOW set associated with covery was rated as poor. it, which takes into consideration all occurrences of A in the right-hand An analysis of the test files where the Lua parsers based on side of the grammar productions. On the other hand, a non-terminal A LPegLabel built a richer AST seems to indicate that the ANTLR parser may have several local FOLLOW sets, where each set only has the does not recover well when there is a missing token or expression in a symbols that can follow A in the right-hand side of a specific production for or if statement. For example, in the following program the ex- where A occurs. pression after the comma is missing: We used 180 syntactically invalid Lua programs to test our error For this example the LPegLabel parser builds an AST with a dummy

4 https://www.lua.org/manual/5.3/ 5 https://github.com/antlr/grammars-v4/blob/master/lua/Lua.g4

25 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28

Table 1 rest of the if statement. Evaluation of recovery strategies applied to parsers of Lua. We also compared the performance of the Lua parser generated by Parser Excellent Good Poor Failed ANTLR with the performance of our PEG-based parser. We used the following tools in our comparison: Custom 100 ( ≈ 56%) 63 ( ≈ 35%) 17 ( ≈ 9%) 0 Global 104 ( ≈ 58%) 60 ( ≈ 33%) 16 ( ≈ 9%) 0 ANTLR 4.6 and 4.7, with Java OpenJDK 9 Local 106 ( ≈ 59%) 58 ( ≈ 32%) 16 ( ≈ 9%) 0 • LPegLabel 1.4, with Lua 5.3 interpreter Antlr 75 ( ≈ 42%) 59 ( ≈ 33%) 46 ( ≈ 26%) 0 • The test machine was an Intel i7-4790 CPU with 16G RAM, running Table 2 Ubuntu 16.04 LTS desktop. Number of error messages reported by each parser. We made two tests. In the first test, we created an invalid Lua program broke.lua that was formed by concatenating almost all the Parser Number of error messages 180 erroneous programs that we have used before. This file has around Custom 208 550 lines, and both parsers report more than 200 syntax errors while Global 277 parsing it. This file was used to measure the performance of parsers ina Local 252 syntactic invalid program. The Lua parser generated by ANTLR 4.7 Antlr 242 crashed when parsing this file, so for this comparison we used aLua parser generated by ANTLR 4.6. expression node. By its turn, as ANTLR first tries token deletion, its We also used both parsers to parse the test files from the Lua 5.3.4 6 parser discards token do and matches print(i) as the missing ex- distribution . The test comprises 28 syntactic valid Lua programs, pression, then ANTLR parser reports a second error when it sees end, which toghether have more than 12k source lines. We needed to change main.lua since it was expecting the do keyword that should follow the second the first line of test file , because the Lua parser generated by expression. ANTLR could not recognize it. The default ANTLR recovery strategy also did not produce a good We ran both parsers 20 times and collected the time reported by System.nanoTime os.time AST for most of the tests related to an invalid statement. The ANTLR for ANTLR and by for Lua. For ANTLR, @init @after Lua grammar defines a block as stat* retstat?, i.e., zero or more state- we measured the time by using and actions in the start ments followed by an optional return statement. Let us consider the rule of the grammar. In the case of LPegLabel, we measured the time invalid Lua program below: before and after calling the main function of the parser. Tables 3 and 4 According to Lua’s syntax, the first line should be local i = 0. The show our results. We can see that the PEG-based parser was sig- ANTLR parser correctly reports an error when it sees the :, but discards nificantly faster (by approximately a factor of six) than the ANTLR the remaining of the input when building the AST, it does not match parser in both tests. any valid statements after the error. The ANTLR parsers builds a richer AST when the previous error 6. Related work happens inside the block of another statement, such as below: In this case, the AST has information related to the statement local In this section, we discuss some error reporting and recovery ap- x = 42, but the block related to the do statement ends when the parser proaches described in the literature or implemented by parser gen- sees the :, what causes an spurious error when trying to match end. erators. Grune and Jacobs [11] also present an overview of some error In Table 2 we saw that ANTLR parser reports a number of error handling approaches. messages similar to the LPegLabel parsers that use a FOLLOW set to Swierstra and Duponcheel [30] show an implementation of parser synchronize with the input. This is according to ANTLR default re- combinators for error recovery, but is restricted to LL(1) grammars. The covery strategy, which uses a kind of enriched local FOLLOW set. recovery strategy is based on a noskip set, computed by taking the FIRST Moreover, after a syntax error, an ANTLR parser only reports the next set of every symbol in the tails of the pending rules in the parser stack. one when it consumes at least one token. The parsers based on LPe- Associated with each token in this set is a sequence of symbols (in- gLabel do not have this behavior, but we could implement it through a cluding non-terminals) that would have to be inserted to reach that separate post-processing step, for example. point in the parse, taken from the tails of the pending rules. Tokens are Since the synchronization strategy of the PEG-based parser is then skipped until reaching a token in this set, and the parser then takes manually designed by the user, it is expected that it would synchronize actions as if it found the sequence of inserted symbols for this token. better after an error. Nevertheless, this is still seems an evidence that Our approach cannot simulate this recovery strategy, as it relies on our approach based on labels and recovery expressions is effective, as the path that the parser dynamically took to reach the point of the having the full power of parsing expression grammars available when error, while our recovery expressions are statically determined from the writing recovery expressions makes it easy to tailor the recovery stra- label. But while their strategy is more resistant to the introduction of tegies for each kind of error, and also to associate a proper error mes- spurious errors than just using the FOLLOW set it still can introduce sage to an error. those. For example, let us consider the following Lua program where the A popular error reporting approach applied for bottom-up parsing is user did not type the condition between an if and the corresponding based on associating an error message to a parse state and a lookahead then: token [31]. To determine the error associated to a parse state, it is The ANTLR parser gives us the following error messages: necessary first to manually provide a sequence of tokens that leadthe The first message correctly indicates the error position, but doesnot parser to that failure state. We can simulate this technique with the use help much to fix the error, as the programmer has to infer thatthe of labels. By using labels we do not need to provide a sample invalid fifteen tokens that the error message lists are tokens that begin ex- program for each label, but we need to annotate the grammar properly. pressions. The second error message is spurious, a side effect of the The error recovery approach for predictive top-down parsers pro- parser skipping then and using print(”that”) as the condition. posed by Wirth [29] was a major influence for several tools, such as Our PEG-based parser reports a single error with the error message ANTLR. In Wirth’s approach, when there is an error during the “syntax error, expected a condition after ’if’” at column 4, which seems more helpful to the programmer, and correctly parses the 6 https://www.lua.org/tests/lua-5.3.4-tests.tar.gz

26 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28

Table 3 hoc, implementation-specific way. Time (in ms) to parse broke.lua. Several PEG implementations such as Parboiled8, Tatsu9, and 10 Avg Median SD PEGTL provide features that facilitate error recovery. The previous version of Parboiled used an error recovery strategy LPegLabel 14 15 0.4 based on ANTLR’s one, and requires parsing the input two or three ANTLR 89 85 11 times in case of an error. Similar to ANTLR, the strategy used by Parboiled was fully automated, and required neither manual interven- tion nor annotations in the grammar. Unlike ANTLR, it was not possible to modify the default error strategy. The current version of Parboiled11 Table 4 does not has an error recovery mechanism. Time (in ms) to parse Lua 5.3 tests. Tatsu uses the fallback alternative technique for error recovery, Avg Median SD with the addition of a skip expression, which is a syntactic sugar for defining a pattern that consumes the input until the skip expression LPegLabel 94 94 1 ANTLR 647 624 76 succeeds. PEGTL allows to define for each rule R a set of terminator tokens T, so when the matching of R fails, the input is consumed until a token t ∈ T is matched. This is also similar to our approach for recovery ex- matching of a non-terminal A, we try to synchronize by using the pressions, but with coarser granularity, and lesser control on what can symbols that can follow A plus the symbols that can follow any non- be done after an error. terminal B that we are currently trying to match (the procedure asso- Rüfenacht [3] proposes a local error handling strategy for PEGs. ciated with B is on the stack). Moreover, the tokens which indicate the This strategy uses the farthest failure position and a record of the parser beginning of a structured element (e.g., while, if) or the beginning of state to identify an error. Based on the information about an error, an a declaration (e.g., var, function) are used to synchronize with the appropriate recovey set is used. This set is formed by parsing expres- input. sions that match the input at or after the error location, and it is used to Our approach can simulate this recovery strategy just partially, determine how to repair the input. because similarly to [30] it relies on information that will be available The approach proposed by Rüfenacht is also similar to the use of a only during the parsing. We can define a recovery expression for a non- recovery expression after an error, but more limited in the kind of re- terminal A according to Wirth’s idea, however, as we do not know covery that it can do. When testing his approach in the context of a statically how will be the stack when trying to match A, the recovery JSON grammar, Rüfenacht noticed long running test cases and men- expression of A would use the FOLLOW sets of all non-terminals whose tions the need to improve memory use and other performance issues. right-hand side have A, and could possibly be on the stack. Coco/R [32] is a tool that generates predictive LL(k) parsers. As the parsers based on Coco/R do not backtrack, an error is signaled when- 7. Conclusions ever a failure occurs. In case of PEGs, as a failure may not indicate an error, but the need to backtrack, in our approach we need to annotate a We have presented a conservative extension of PEGs that is well- grammar with labels, a task we tried to make more automatic. suited for implementing parsers with a robust mechanism for re- In Coco/R, in case of an error the parser reports it and continues covering from syntax errors in the input. Our extension is based on the until reaching a synchronization point, which can be specified in the use of labels to signal syntax errors, and differentiates them from reg- grammar by the user through the use of a keyword SYNC. Usually, the ular failures, together with the use of recovery expressions associated beginning of a statement or a semicolon are good synchronization with those labels. points. When signaling an error with a label that has an associated recovery Another complementary mechanism used by Coco/R for error re- expression, the parser logs the label and the error position, then pro- covery is weak tokens, which can be defined by a user though the WEAK ceeds with the recovery expression. This recovery expression is a reg- keyword. A weak token is one that is often mistyped or missing, as a ular parsing expression, with access to all the parsing rules that the comma in a parameter list, which is frequently mistyped as a semicolon. grammar provides. We showed how to use the information provided by When the parser fails to recognize a weak token, it tries to resume FIRST and FOLLOW sets when building a recovery expression. parsing based also on tokens that can follow the weak one. We also gave an extension to the formal semantics of a parsing Labeled failures plus recovery expressions can simulate the SYNC machine for labeled PEGs that adds both furthest failure tracking and and WEAK keywords of Coco/R. Each use of SYNC keyword would error recovery to the parsing machine. Parsing-machine based seman- correspond to a recovery expression that advances the input to that tics form the core of several efficient PEG implementations. We proved point, and this recovery expression would be used for all labels in the that the extended parsing machine is correct with regards to our ex- parsing extent of this synchronization point. A WEAK can have a re- tended definition of PEGs. covery expression that tries also to synchronize on its FOLLOW set. We extended one of these virtual machine-based implementations Coco/R avoids spurious error messages during synchronization by with our error recovery mechanism, and tested several parsers for the only reporting an error if at least two tokens have been recognized Lua programming language on a suite of 180 programs with syntax correctly since the last error. This is easily done in labeled PEG parsers errors to assess how close the syntax trees produced by the parsers with through a separate post-processing step. error recovery are to the trees we get from manually fixing the syntax A common way to implement error recovery in PEG parsers is to add errors present in the programs. We concluded that error recovery gives an alternative to a failing expression, where this new alternative works at least good results for 91% of our test programs, and excellent results as a fallback. Semantic actions are used for logging the error. This for 56% of them. strategy is mentioned in the manual of Mouse [33] and also by users of We also compared one of our parsers with a Lua parser with LPeg7. These fallback expressions with semantic actions for error log- ging are similar to our labels and recovery expressions, but in an ad- 8 https://github.com/sirthias/parboiled/wiki 9 https://tatsu.readthedocs.io 10 https://github.com/taocpp/PEGTL 7 See http://lua-users.org/lists/lua-l/2008-09/msg00424.html 11 https://github.com/sirthias/parboiled2/

27 S. Queiroz de Medeiros, F. Mascarenhas Journal of Visual Languages and Computing 49 (2018) 17–28 automatic error recovery generated by ANTLR, a popular parser gen- [12] R.R. Redziejowski, Applying classical concepts to parsing expression grammar, erator tool. The comparison shows that our PEG-based parser has better Fundam. Inform. 93 (2009) 325–336. [13] R.R. Redziejowski, More about converting BNF to PEG, Fundam. Inform. 133 error recovery, error messages, and better performance than the (2014) 257–270, https://doi.org/10.3233/FI-2014-1075. ANTLR-generated one. [14] F. Mascarenhas, S. Medeiros, R. Ierusalimschy, On the relation between context-free Labeled PEGs with recovery expressions give the grammar writer grammars and parsing expression grammars, Sci. Comput. Program. 89 (2014) 235–250, https://doi.org/10.1016/j.scico.2014.01.012. URL http://www. great control over the error recovery strategy, at the cost of an anno- sciencedirect.com/science/article/pii/S0167642314000276 . tation burden that we judge to not be too onerous. This annotation [15] R. Ierusalimschy, A text pattern-matching tool based on parsing expression gram- burden can be greatly reduced when the grammar has mostly LL(1) mars, Software 39 (3) (2009) 221–258. choices. We are currently working on an algorithm that automatically [16] S. Medeiros, F. Mascarenhas, R. Ierusalimschy, From regexes to parsing expression grammars, Sci. Comput. Program. 93 (2014) 3–18, https://doi.org/10.1016/j.scico. annotates a PEG with labels [34]. 2012.11.006. Finally, our evaluation did not try to take into account errors that [17] G. Ripley, F.C. Druseikis, A statistical analysis of syntax errors, Comput. Lang. 3 (4) are more frequent while writing Lua programs from scratch in the (1978) 227–240, https://doi.org/10.1016/0096-0551(78)90041-3. [18] T.J. Pennello, F. DeRemer, A forward move algorithm for lr error recovery, context of an IDE or text editor. Such a study can also be explored in Proceedings of the 5th ACM SIGACT-SIGPLAN Symposium on Principles of future work. Programming Languages, POPL ’78, ACM, New York, NY, USA, 1978, pp. 241–254, https://doi.org/10.1145/512760.512786. [19] M.G. Burke, G.A. Fisher, A practical method for lr and ll syntactic error diagnosis References and recovery, ACM Trans. Program. Lang. Syst. 9 (2) (1987) 164–197, https://doi. org/10.1145/22719.22720. [1] B. Ford, Parsing expression grammars: A recognition-based syntactic foundation, [20] J.A. Dain, A practical minimum distance method for syntax error handling, Comput. Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Lang. 20 (4) (1994) 239–252, https://doi.org/10.1016/0096-0551(94)90006-X. Programming Languages, POPL ’04, ACM, New York, NY, USA, 2004, pp. 111–122. [21] R. Corchuelo, J.A. Pérez, A. Ruiz, M. Toro, Repairing syntax errors in lr parsers, [2] B. Ford, Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking, ACM Trans. Program. Lang. Syst. 24 (6) (2002) 698–710, https://doi.org/10.1145/ Massachusetts Institute of Technology, 2002 Master’s thesis. 586088.586092. [3] M. Rüfenacht, Error Handling in PEG Parsers, University of Berne, 2016 Master’s [22] P. Degano, C. Priami, Comparison of syntactic error handling in lr parsers, Softw. thesis. Pract. Exper. 25 (6) (1995) 657–679, https://doi.org/10.1002/spe.4380250606. [4] A.M. Maidl, F. Mascarenhas, R. Ierusalimschy, Exception Handling for Error [23] M. de Jonge, L.C.L. Kats, E. Visser, E. Söderberg, Natural and flexible error recovery Reporting in Parsing Expression Grammars, in: A.R. Du Bois, P. Trinder (Eds.), for generated modular language environments, ACM Trans. Program. Lang. Syst. 34 Programming Languages, LNCS, 8129 Springer, Berlin, Heidelberg, Germany, 2013, (4) (2012), https://doi.org/10.1145/2400676.2400678. pp. 1–15, , https://doi.org/10.1007/978-3-642-40922-6_1. [24] R. Ierusalimschy, Programming in Lua, 4th, Lua.Org, Rio de Janeiro, Brazil, 2016. [5] A.M. Maidl, F. Mascarenhas, S. Medeiros, R. Ierusalimschy, Error reporting in [25] S. Medeiros, LPegLabel - an extension of LPeg that supports labeled failures, 2014, parsing expression grammars, Sci. Comput. Program. 132 (P1) (2016) 129–140, (https://github.com/sqmedeiros/lpeglabel). [Visited on September 2017] https:// https://doi.org/10.1016/j.scico.2016.08.004. github.com/sqmedeiros/lpeglabel. [6] S. Medeiros, F. Mascarenhas, Syntax error recovery in parsing expression grammars, [26] C. Stirling, Follow set error recovery, Software 15 (3) (1985) 239–257, https://doi. Proceedings of the 33rd Annual ACM Symposium on Applied Computing, SAC ’18, org/10.1002/spe.4380150303. ACM, New York, NY, USA, 2018, pp. 1195–1202, https://doi.org/10.1145/ [27] T. Parr, ANTLR, 2017, (http://www.antlr.org). Visited on September 2017. 3167132.3167261. [28] T. Parr, The definitive ANTLR 4 reference, 2nd, Pragmatic Bookshelf, Raleigh, NC, [7] S. Medeiros, R. Ierusalimschy, A parsing machine for pegs, Proceedings of the 2008 USA, 2013. Symposium on Dynamic Languages, DLS ’08, ACM, New York, NY, USA, 2008, pp. [29] N. Wirth, Algorithms + data structures = programs, Prentice Hall PTR, Upper 2:1–2:12, https://doi.org/10.1145/1408681.1408683. Saddle River, NJ, USA, 1978. [8] S. Medeiros, F. Mascarenhas, A parsing machine for parsing expression grammars [30] S.D. Swierstra, L. Duponcheel, Deterministic, error-correcting combinator parsers, with labeled failures, Proceedings of the 31st Annual ACM Symposium on Applied Advanced Functional Programming, Lecture Notes in Computer Science 1129 Computing, SAC ’16, ACM, New York, NY, USA, 2016, pp. 1960–1967, https://doi. Springer, Berlin, Heidelberg, Germany, 1996, pp. 184–207. org/10.1145/2851613.2851750. [31] C.L. Jeffery, Generating LR syntax error messages from examples, ACM Trans. [9] T. Parr, K. Fisher, Ll(*): The foundation of the antlr parser generator, Proceedings of Program. Lang. Syst. (TOPLAS) 25 (5) (2003) 631–640. the 32Nd ACM SIGPLAN Conference on Programming Language Design and [32] H. Mössenböck, The Compiler Generator Coco/R, 2010. URL http://www.ssw.uni- Implementation, PLDI ’11, ACM, New York, NY, USA, 2011, pp. 425–436, https:// linz.ac.at/Coco/Doc/UserManual.pdf. doi.org/10.1145/1993498.1993548. [33] R.R. Redziejowski, Mouse: from Parsing Expressions to a Practial Parser, 2017. URL [10] T. Parr, S. Harwell, K. Fisher, Adaptive ll(*) parsing: The power of dynamic analysis, http://mousepeg.sourceforge.net/Manual.pdf. Proceedings of the 2014 ACM International Conference on Object Oriented [34] S.Q. de Medeiros, F. Mascarenhas, Towards automatic error recovery in parsing Programming Systems Languages & Applications, OOPSLA ’14, ACM, New York, expression grammars, Proceedings of the XXII Brazilian Symposium on NY, USA, 2014, pp. 579–598, https://doi.org/10.1145/2660193.2660202. Programming Languages, SBLP ’18, ACM, New York, NY, USA, 2018, pp. 3–10, [11] D. Grune, C.J. Jacobs, Parsing techniques: A Practical guide, 2nd, Springer, Berlin, https://doi.org/10.1145/3264637.3264638. Heidelberg, Germany, 2010.

28