A Machine for PEGs

∗ Sérgio Medeiros Roberto Ierusalimschy [email protected] [email protected] Department of Computer Science PUC-Rio, Rio de Janeiro, Brazil

ABSTRACT parsing approaches. The PEG formalism gives a convenient Parsing Expression Grammar (PEG) is a recognition-based syntax for describing top-down parsers for unambiguous lan- foundation for describing syntax that renewed interest in guages. The parsers it describes can parse strings in linear top-down parsing approaches. Generally, the implementa- time, despite , by using a memoizing algorithm tion of PEGs is based on a recursive-descent parser, or uses called Packrat [1]. Although Packrat has guaranteed worst- a algorithm. case linear time complexity, it also has linear space com- We present a new approach for implementing PEGs, based plexity, with a rather large constant. This makes Packrat on a virtual parsing machine, which is more suitable for impractical for parsing large amounts of data. pattern matching. Each PEG has a corresponding program LPEG [7, 11] is a pattern-matching tool for Lua, a dy- that is executed by the parsing machine, and new programs namically typed scripting language [6, 8]. LPEG uses PEGs are dynamically created and composed. The virtual machine to describe patterns instead of the more popular Perl-like is embedded in a scripting language and used by a pattern- ”regular expressions” (regexes). Regexes are a more ad-hoc matching tool. way of describing patterns; writing a complex regex is more We give an operational semantics of PEGs used for pat- a trial and error than a systematic process, and their per- tern matching, then describe our parsing machine and its formance is very unpredictable. PEGs offer a way of writing semantics. We show how to transform PEGs to parsing ma- patterns that is simple for small patterns, while providing chine programs, and give a correctness proof of our trans- better organization for the more complex patterns. formation. As pattern-matching tools usually have to deal with much larger inputs than parsing tools, this precluded the use of a Packrat-like algorithm for LPEG. Therefore, LPEG adopts a Categories and Subject Descriptors new approach, using a virtual parsing machine, where each D.3.3 [Programming Languages]: Language Constructs pattern translates to a program for this machine. These and Features; D.3.1 [Programming Languages]: Formal programs are built at runtime, and can also be dynamically Definitions and Theory composed into bigger programs (that represent bigger pat- terns). This piecemeal construction of programs fits both the dynamic nature of Lua programming, and the compos- General Terms ability of the PEG formalism. Operational semantics, scripting languages As each PEG pattern is compiled in a sequence of instruc- tions of our parsing machine during runtime, this makes the Keywords model of LPEG much more appropriated to be used by a dynamic language when compared to a more traditional ap- parsing machine, Parsing Expression Grammars, pattern proch, which uses static compilation and linking of programs matching generated from PEG grammars. The parsing machine is also fast, with the performance 1. INTRODUCTION of LPEG being similar to the performance of other parser The Parsing Expression Grammar (PEG) formalism for and pattern-matching tools. Moreover, the machine has a language recognition [2] has renewed interest in top-down very simple model, which makes its implementation easy, and seems a very good candidate for a Just-In-Time (JIT) ∗Supported by CNPq . In a forthcoming paper [7], one of the authors presents the LPEG tool more in depth, giving several examples of its use. That paper also gives an operational semantics of the parsing machine instructions, plus an informal description of Permission to make digital or hard copies of all or part of this work for how the machine executes a program and some benchmarks personal or classroom use is granted without fee provided that copies are comparing LPEG with POSIX regex and PCRE. not made or distributed for profit or commercial advantage and that copies The main contribution of the this paper resides on the full bear this notice and the full citation on the first page. To copy otherwise, to formal specification of the parsing machine, and on the proof republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. of correctness of the transformation between PEG patterns Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. ε ∈ Pattern pattern succeeds. Otherwise, we try pattern p2, and if it ‘.’ ∈ Pattern succeeds the whole pattern succeeds too, otherwise the whole ‘’ ∈ Pattern pattern fails. If p ∈ Pattern then p?, p∗, p+, &p, and !p ∈ Pattern Figure 1 has an operational semantics for patterns, where If p1 and p2 ∈ Pattern then p1p2 and p1/p2 ∈ Pattern match : Pattern × Σ∗ × N → N ∪ {nil} is the semantics of a pattern, given a subject and a position (position 1 is the Table 1: Definition of PEG patterns beginning of the subject). If a match succeeds, the result is the position in the subject after the match, otherwise, if the match fails, the result is nil. and programs of our machine. The looping rule (for repetition) of the semantics pre- The present paper also discuss some optimizations appli- cludes us from doing a straightforward structural induction cable to our machine which reduce the amount of backtrack- on patterns. We have to do induction on a measure of pat- ing without using memoization, and gives the associated tern complexity that we will state as | p | + | s | − i, that is, transformation. As our machine is explicitly designed for the size of the pattern plus the length of the subject, minus PEGs, its design relies on the fact that PEGs only use lim- the current position. For all semantic rules but the loop- ited backtracking. ing rule, it is trivial to see that the antecedent has a smaller The rest of this paper is organized as follows: Section 2 measure than the consequent, as the pattern itself is smaller, reviews some PEG concepts and describes the virtual ma- and the position doesn’t decrease. For the looping rule, the chine and its operational semantics; Section 3 proves the induction will hold only when the position advances by at transformation between simple PEGs (those consisting of least one; in other words, induction is only possible when a single production) and their corresponding programs is j > 0. Informally, repetition on a pattern that succeeds but correct; Section 4 augments the virtual machine to enable does not advance leads to an infinite loop. some optimizations, and proves the correctness of these op- Usually, each PEG pattern can be associated with a recursive- timizations; Section 5 extends the previous proofs to cover descent parser, but we will take a different approach. In- PEGs in general; Section 6 discusses some related work and stead, each PEG pattern translates to a program, which presents a benchmark comparing LPEG with other tools; executes on a virtual parsing machine. finally, Section 7 summarizes our results. Our parsing machine for PEGs has more in common with virtual machines for imperative programming languages than 2. A PARSING MACHINE FOR PEGS with abstract machines in language theory. It executes pro- grams made up of atomic instructions that change machine’s Parsing Expression Grammar (PEGs) is a recognition- state. The machine has a program counter that addresses based formal foundation for language syntax [2]. Although the next instruction to be executed, a register that holds PEGs are similar to Context Free Grammars (CFGs), a the current position in the subject, and a stack that the ma- key difference is that, while a CFG generates a language, chine uses for pushing both return addresses and backtrack a PEG describes a parser. PEGs have limited backtracking entries. A return address is just a new value for the program through an ordered choice operator, and unlimited looka- counter, while a backtrack entry holds both an address and head through syntactic predicates. a position in the subject. The machine has the following The original PEG paper uses a grammar-like formal def- basic instructions: inition of PEGs. We are going to start with a simpler, less Char x: tries to match the character x against the current powerful definition, which is equivalent to the right-hand subject position, advancing one position if successful. side of a PEG production (minus any non-terminals). We Any: advances one position if the end of the subject was call these simple expressions patterns. Later in the paper not reached; it fails otherwise. we extend patterns to reintroduce grammars, but in a way Choice l: pushes a backtrack entry on the stack, where l that preserves their composability. Table 1 has the abstract is the offset of the alternative instruction. syntax of patterns. Jump l: relative jump to the instruction at offset l. We have that ε represents the empty string, which is a Call l: pushes the address of the next instruction in the pattern that never fails. stack, and jumps to the instruction at offset l. The pattern ‘.’ matches any character, while the pattern Return: pops an address from the stack and jumps to it. ‘c’ matches a given character c. The pattern p? represents an Commit l: commits to a choice, popping the top entry from optional match and it always succeeds. The repetition pat- the stack, throwing it away, and jumping to the instruction tern p∗ always succeeds too. It matches the pattern p zero at offset l. or more times. The other repetition pattern, p+, matches p Fail: forces a failure. When any failure occurs, the ma- one or more times and can fail. chine pops the stack until it finds a backtrack entry, then The symbols ! and & are predicate operators, and do not uses that entry plus the stack as the new machine state. consume any input. The pattern !p succeeds if the matching Figure 2 presents the operational semantics of the parsing of pattern p fails, and succeeds otherwise. The pattern &p is machine as a relation among machine states. The program the opposite, it succeeds when pattern p succeeds, and fails P that the machine is executing, and the subject S, are when p fails. implicit. The state is either a triple N × N × Stack, with The concatenation of two patterns p and p is indicated 1 2 the next instruction to execute (pc), the current position in simply by the juxtaposition of both patterns. For example, the subject (i), and a stack (e), or Failhei, a failure state p p indicates the concatenation of p and p . 1 2 1 2 with stack e. Stacks are lists of N ∪ N × N , where a stack The / is the ordered choice operator and allows a limited position of form N represents a return address, while a stack form of backtracking. A pattern like p /p , first tries to 1 2 position of form N × N represents a backtrack entry. match the pattern p1, and if this match succeeds the whole s[i] = ‘c’ s[i] 6= ‘c’ Matching a character (ch.1) (ch.2) match ‘c’ s i = i+1 match ‘c’ s i = nil

i ≤ |s| i > |s| Matching any character (any.1) (any.2) match . s i = i+1 match . s i = nil

match p s i = nil match p s i = i+j Not Predicate (not.1) (not.2) match !p s i = i match !p s i = nil

match p s i = i+j match p s i = nil And Predicate (and.1) (and.2) match &p s i = i match &p s i = nil

match p s i = i+j match p s i + j = i+j+k Concatenation 1 2 (con.1) match p1p2 s i = i+j+k

match p s i = i+j match p s i + j = nil match p s i = nil 1 2 (con.2) 1 (con.3) match p1p2 s i = nil match p1p2 s i = nil

match p s i = nil match p s i = nil Ordered Choice 1 2 (ord.1) match p1/p2 s i = nil

match p s i = i+j match p s i = nil match p s i = i+k 1 (ord.2) 1 2 (ord.3) match p1/p2 s i = i+j match p1/p2 s i = i+k

match p s i = i+j match p ∗ s i + j = i+j+k match p s i = nil Repetition (rep.1) (rep.2) match p ∗ s i = i+j+k match p ∗ s i = i

Figure 1: Operational Semantics of Patterns The relation −−−−−−−→Instruction relates two states when the in- Considering first the case when the match function suc- struction addressed by pc in the antecedent state matches ceeds, what we want to prove is: the label, and the guard (if present) is valid. The transitive closure of this relation is an execution of the machine. If match ‘c’ s i = i + j then hpc, i, ei −−−→hChar pc + 1, i + 1, ei 3. TRANSFORMING PATTERNS TO PRO- GRAMS In this case, as we want to match just one character, we have that j must be 1. Looking Figure 1, we can see the As was mentioned earlier, each PEG pattern translates rule (ch.1) is the only one involving the successful matching to a program that executes in a parsing machine. The pro- of a character. By this rule, we know that s[i] = ‘c’. cess of translating a pattern into a program is bottom up, It is trivial to see that the semantics of the Char instruc- and done at runtime. The simplest patterns are translated tion guarantees that our program will advance the current to simple programs, and then programs are combined ac- subject position by 1, and will not change the stack, when cording to rules specific for each PEG operation. Programs the matching of ‘c’ succeeds. are opaque entities for the translation process, the pattern Considering now the case where the matching fails, we that originated them is not important, as long as the pro- need to prove that: grams are valid. In our implementation the process is fully incremental, and combining programs is a simple matter of If match p s i = nil then concatenating their texts. Char To represent this compilation process, we will use a trans- hpc, i, ei −−−→ Failhei formation function Π, from Pattern to Program. Given a pattern p, we have that Π(p) represents the program asso- The semantic rule (ch.2) is the only rule related to the un- ciated with pattern p. We will also use the notation |Π(p)|, successful matching of a character. By this rule, we know which means the number of instructions of the program that s[i] 6= ‘c’. Π(p). Again, we can rely on the semantics of the Char instruc- The next step is to prove the correctness of the trans- tion, which will lead the parsing machine to a state indicat- formation that converts a PEG pattern to a corresponding ing that a fail occurred, when the pattern does not match. program for the virtual parsing machine. This completes the proof. All transformations use concatenation to build bigger pro- The case of the Any instruction is very similar to the case grams from smaller ones. Formally, given programs P1 and of the Char instruction, because there is also a direct trans- P2, such that: formation from a pattern p that matches any character, to a program that consists only of an Any. The correctness of this P1 hpc1, i, ei −−→hpc1 + |P1|, i + j, ei transformation can be proved with the help of the semantic

P2 rules (any.1) and (any.2). hpc2, i + j, ei −−→hpc2 + |P2|, i + j + k, ei The next proof is about the concatenation of two PEG

The concatenation P1P2 has the following property: patterns, p1 and p2. We state the following transformation related to the concatenation of two patterns: P1P2 hpc, i, ei −−−→hpc + |P1| + |P2|, i + j + k, ei Π(p1 p2) ≡ Π(p1) Π(p2) This property holds when all the jumps of a program are for instructions of the program, or for the first instruction In other words, when transforming p1p2 to a program, we of the next program. will first execute the program Π(p1), and then program Considering hpc, i, ei as the machine’s initial state, where Π(p2). Considering the case where match succeeds, what pc points to the first instruction of Π(p), we have that a we are going to prove is: pattern p is equivalent to a program Π(p), if when p succeeds, then the following relation is true: If match p1p2 s i = i + j + k then

Π(p1p2) If match p s i = i + j then hpc, i, ei −−−−−→hpc + |Π(p1)| + |Π(p2)|, i + j + k, ei Π(p) hpc, i, ei −−−→hpc + |Π(p)|, i + j, ei The rule (con.1) is the one related to the successful concate- We have also that, if p fails, then the following relation nation of two patterns. By this rule, we know that p1 and should hold: p2 matched, with the matching of p1 advancing j positions of the input string, and the matching of p advancing more If match p s i = nil then 2 k positions. Π(p) hpc, i, ei −−−→ Failhei Let hpc, i, ei be the initial virtual machine state. The corresponding program Π(p1p2) of our PEG machine will It can be noticed that in both cases the stack remains the execute the following sequence of transitions, where in both same after the execution of the program Π(p). transitions we use the induction hypothesis: We start proving that the transformation Π is correct for Π(p ) simple patterns, and after we prove the transformation is −−−−→h1 pc + |Π(p )|, i + j, ei also correct for more complex patterns. 1 Π(p2) Let’s then take a look on a PEG pattern that matches −−−−→hpc + |Π(p1)| + |Π(p2)|, i + j + k, ei a certain c character. The transformation here is Π(‘c’) ≡ Char c, such that we can express the pattern ‘c’ directly by This proves the first case. a Char instruction. The other case is when p1p2 fails. What we have to prove Char x hpc, i, ei −−−−−→ hpc + 1, i + 1, ei S[i] = x Char x hpc, i, ei −−−−−→ Failhei S[i] 6= x Any hpc, i, ei −−−→ hpc + 1, i + 1, ei i + 1 ≤ |S| Any hpc, i, ei −−−→ Failhei i + 1 > |S| Choice l hpc, i, ei −−−−−−−→ hpc + 1, i, (pc + l, i): ei Jump l hpc, i, ei −−−−−→ hpc + l, i, ei Call l hpc, i, ei −−−−−→ hpc + l, i, (pc + 1) : ei Return hpc0, i, pc1 : ei −−−−−−→ hpc1, i, ei Commit l hpc, i, h : ei −−−−−−−→ hpc + l, i, ei Fail hpc, i, ei −−−−→ Failhei any Failhpc : ei −−−→ Failhei any Failh(pc, i1): ei −−−→ hpc, i1, ei

Figure 2: Operational Semantics of the Parsing Machine is: There are two rules related to the success of an ordered choice. Let’s first consider the rule (ord.2). We know that If match p1p2 s i = nil then p1 matched, advancing j positions of the input subject. Π(p p ) hpc, i, ei −−−−−→1 2 Failhei The corresponding sequence of transitions, assuming hpc, i, ei as the initial state, where pc points to the first instruction From Figure 1, we can see there are two rules related to of Π(p1/p2), is shown below: the concatenation which result in a fail. Let’s first consider Choice the rule (con.3). In this case, the pattern p1 fails, and the −−−−→hpc + 1, i, (pc + |Π(p1)| + 2, i): ei machine will do the following transition: Π(p1) −−−−→hpc + |Π(p1)| + 1, i + j, (pc + |Π(p1)| + 2, i): ei

Π(p1) Commit hpc, i, ei −−−−→ Failhei −−−−−→hpc + |Π(p1)| + |Π(p2)| + 2, i + j, ei

With Π(p2) not being executed. The first thing the program does is to save the current state Let’s now consider the rule (con.2), in which p1 succeeds, of the machine, because if p1 fails, we have to backtrack. As advancing j positions of the input, but the pattern p2 fails. p1 succeeds, we just discard the backtrack entry and jump This is represented by the following sequence of transitions: to the end of the program, using the Commit instruction. We can notice that after the complete execution of the Π(p1) program, the stack remains the same, with the backtrack −−−−→hpc + |Π(p1)|, i + j, ei entry being discarded: Π(p2) −−−−→ Failhei Analyzing now the rule (ord.3), we have that p1 fails, and p2 succeeds, matching k characters, such that j = k in this What proves the transformation for the pattern p1 p2 is cor- case. rect. The associated sequence of transitions is: Let’s now see the case of the ordered choice operator. Choice Given the PEG pattern p1/p2, we have the following trans- −−−−→hpc + 1, i, (pc + |Π(p1)| + 2, i): ei formation: Π(p1) −−−−→Failh(pc + |Π(p1)| + 2, i): ei Π(p /p ) ≡ Choice |Π(p )| + 2 1 2 1 F ail −−−→hpc + |Π(p1)| + 2, i, ei Π(p1) Π(p2) Commit |Π(p2)| + 1 −−−−→hpc + |Π(p1)| + |Π(p2)| + 2, i + k, ei Π(p2) The program first pushes a backtrack entry, and tries to match p . As p fails, the machine backtracks, and then The transformation says that Π(p /p ) is equivalent to a 1 1 1 2 matches p , what advances k positions of the input string, program where the first instruction is a Choice, followed by 2 as expected. the program Π(p ), then by a Commit instruction, and finally 1 Considering now the case where the match function fails, by program Π(p ). 2 what we have to prove is: Considering the case when p1/p2 succeeds, what we are going to prove is: If match p1/p2 s i = nil then

Π(p1/p2) If match p1/p2 s i = i + j then hpc, i, ei −−−−−−→ Failhei

Π(p1/p2) hpc, i, ei −−−−−−→hpc + |Π(p1)| + |Π(p2)| + 2, i + j, ei The rule (ord.1) is the only one related to this case. By this rule, we know that neither p1 nor p2 match. What we are going to prove first is: The sequence of transitions associated with this case is shown below: If match p ∗ s i = i then Π(p∗) Choice hpc, i, ei −−−−→hpc + |Π(p∗)|, i, ei −−−−→hpc + 1, i, (pc + |Π(p1)| + 2, i): ei

Π(p1) −−−−→ Failh(pc + |Π(p1)| + 2, i): ei This case is related to the semantic rule (rep.2), where p∗ does not match an input s from position i. We have the F ail −−−→hpc + |Π(p1)| + 2, i, ei following sequence of transitions:

Π(p2) hpc + |Π(p )| + 2, i, ei −−−−→ Failhei Choice 1 −−−−→hpc + 1, i, (pc + |Π(p)| + 2, i): ei

Π(p) This sequence is similar to the previous one, but now as p2 −−−→ Failh(pc + |Π(p)| + 2, i): ei fails, the machine goes to a fail state. This completes the proof of the ordered choice. −−−→hF ail pc + |Π(p)| + 2, i, ei We will now consider the transformation of the not pred- icate. Let p be a pattern, we have that: We can see that the current subject position did not change with the execution of Π(p∗). Π(!p) ≡ Choice |Π(p)| + 3 The other case we have to prove is: Π(p) If match p ∗ s i = i+j+k then Commit 1 Π(p∗) Fail hpc, i, ei −−−−→hpc + |Π(p∗)|, i + j + k, ei

Considering the case where the pattern !p fails, we have that: This case is related to the semantic rule (rep.1), and we have that ∗p matches an input s from position i. The sequence If match !p s i = nil then of transitions is presented below: Π(!p) hpc, i, ei −−−→ Failhei −−−−→hChoice pc + 1, i, (pc + |Π(p)| + 2, i): ei

Π(p) The semantic rule that corresponds to this case is (not.2). −−−→hpc + 1 + |Π(p)|, i + j, (pc + |Π(p)| + 2, i): ei By this rule, we know the matching of p succeeds, although Commit the pattern !p fails. Therefore, we have the following se- −−−−−→hpc, i + j, ei Π(p∗) quence of transitions: −−−−→hpc + |Π(p∗)|, i + j + k, ei

Choice −−−−→hpc + 1, i, (pc + |Π(p)| + 3, i): ei As we can see, p succeeds the first time, matching j charac- Π(p) ters, where we are considering j > 0. After this matching, −−−→hpc + |Π(p)| + 1, i + j, (pc + |Π(p)| + 3, i): ei we have a less complex pattern, considering our measure −−−−−→hCommit pc + |Π(p)| + 2, i + j, ei of pattern complexity as | p | + | s | − i. Then the ma- chine executes Commit, coming back to the beginning of the −−−→F ail Failhei program, and we use induction on the pattern complexity, reaching the final state hpc + |P i(p∗)|, i + j + k, ei. The program first saves the current machine state, and next Since we have proved that the transformation from a pat- tries to match p. As p succeeds, the following Commit dis- tern p∗ to a program in our PEG machine was correct, and cards the backtrack entry, and the Fail instruction is then as we have also made the corresponding proof for the con- executed. catenation of patterns, we can simply define the pattern p+ The other case, when pattern !p succeeds, corresponds to as p p∗. the semantic rule (not.1). In this case, we have that j = 0. By (not.1), we know the matching of p fails, and the machine restores the entry pushed by the initial Choice instruction, 4. OPTIMIZATIONS what leads to the final state hpc + |Π(p)| + 3, i, ei. As we have just proved, the set of instructions that we With the transformation of the not predicate proved cor- have been using is enough to define a virtual parsing machine rect, given a pattern p, we can trivially define the and pred- for PEGs. But it would be convenient to define some extra icate as !!p. In the same way, as the concatenation and instructions, so we can have more efficient implementations the ordered choice transformations of patterns were already of the patterns. In short, we will define new instructions to proved correct too, we can define the predicate p? as &pp / !p. get faster programs. The next step corresponds to the transformation of the Figure 3 presents the semantics of the new set of instruc- repetition pattern. Let p be a PEG pattern, such that the tions. To illustrate the use of these extra instructions, we construction p∗ means zero or more repetitions of p. In this will now redefine some of the patterns using the new set of case, we have the following transformation from pattern to instructions. the virtual machine instructions: Let’s start redefining the not predicate. In the new defini- tion, we use the instruction FailTwice, which discards the Π(p∗) ≡ Choice |Π(p)| + 2 top entry of the stack (in the case of the not predicate, it Π(p) discards the entry pushed by the previous Choice instruc- Commit − (|Π(p)| + 1) tion), and then fails, like the Fail instruction. The new PartialCommit l hpc0, i0, (pc1, i1): ei −−−−−−−−−−−−−→ hpc0 + l, i0, (pc1, i0): ei FailTwice hpc, i, h : ei −−−−−−−−−→ Failhei BackCommit l hpc0, i0, (pc1, i1): ei −−−−−−−−−−→ hpc0 + l, i1, ei

Figure 3: Operational Semantics of Extra Instructions transformation is: this, the BackCommit instruction receives a label l and the Π(!p) ≡ Choice |Π(p)| + 2 machine goes to the position pc0 +l. This instruction is used in a more efficient implementation of the and predicate. Π(p) The current implementation of the parsing machine has FailTwice also some specific instructions for character sets. There is a With the new transformation, the sequence of transitions Charset instruction, which matches a character if it belongs when !p fails, where the corresponding semantic rule is (not.2), to a set of characters. Another instruction related to char- becomes: acter sets is Span, which is used to implement the repetition of a pattern that represents a character set. Choice −−−−→hpc + 1, i, (pc + |Π(p)| + 2, i): ei These two instructions improve the performance of the Π(p) machine when dealing with character sets, and can be eas- −−−→hpc + |Π(p)| + 1, i + j, (pc + |Π(p)| + 2, i): ei ily translated to an equivalent program that uses the Char −−−−−−−→F ailT wice Failhei instruction. Using FailTwice, we can see the number of transitions was 4.1 Head-fail instructions smaller in this case. Comparing with the previous sequence Although we have already extended the set of instructions of transitions of the not predicate, we can see that this trans- of the virtual machine, and improved the definition of the formation is also correct. patterns accordingly, we still can go further. Our PEG ma- Another pattern that uses the extra instructions is the chine is designed for pattern-matching, so it should be fast repetition pattern p∗. One problem of the previous defini- when dealing with lexical structures, which sometimes is a tion of p∗ is that we are always pushing a new backtrack bottleneck of PEGs implementations [13]. entry and discarding it when the pattern p matches. Because of this, there is a third group of instructions: the We can notice from the previous definition of p∗, that head-fail optimization instructions, which are presented in when we push a new backtrack entry, the only thing that Figure 4. This group of instructions were created to address changes is the current subject position, thus, instead of dis- the problem of head fails. If a pattern fails when doing its carding the backtrack entry, we could have just updated it. first check, we say it was a head fail. As an example, if This is exactly what the PartialCommit instruction does. we search for the word “edge” in an English language text, By using it, we have the following transformation: almost 90% of all fails may be head fails, given the typical Π(p∗) ≡ Choice |Π(p)| + 2 frequency of around 11% for the letter ‘e’. The head-fail Π(p) instructions avoid the backtracking associated with a head fail. PartialCommit − |Π(p)| The TestChar instruction, for example, checks if a given As an example, we show below the new sequence of transi- character is equal to the current subject position. In case it tions for the repetition pattern when p∗ matches an input is, the machine matches the character and then executes the s from position i, which corresponds to the semantic rule next instruction. In case it is not, the TestChar jumps to the (rep.1): offset. The advantage of using the TestChar instruction is that we can save the cost of a backtracking when a head-fail −−−−→hChoice pc + 1, i, (pc + |Π(p)| + 2, i): ei occurs.

Π(p) As a consequence of the introduction of the head-fail op- −−−→hpc + 1 + |Π(p)|, i + j, timization instructions, we need to modify the Choice in- (pc + |Π(p)| + 2, i): ei struction, adding an offset to it. Now, when this instruction P artialCommit saves the current machine state, it saves the current subject −−−−−−−−−−→hpc + 1, i + j, (pc + |Π(p)| + 2, i + j): ei position minus a given offset. Thus, the new operational Π(p∗) semantics of the Choice instruction is: −−−−→hpc + |Π(p∗)|, i + j + k, ei Choice l o Notice that the Choice instruction executes just one time, hpc, i, ei −−−−−−−−→hpc + 1, i, (pc + l, i − o): ei as the PartialCommit instruction leads the machine back to the beginning of Π(p). The machine does less transitions, All the previous uses of Choice have an offset of zero. and avoids the use of an expensive Choice instruction at Making use of the TestChar instruction, we can then op- each repetition step. timize the ordered choice, as well as other patterns. Given The BackCommit instruction is a variant of the Commit patterns p1 and p2, let’s consider an ordered choice of the instruction. As the Commit instruction, the BackCommit in- form ‘c’p1/p2, i.e., we first try to match a character ‘c’, and struction restores a backtrack entry, but it does not jump if we succeed then we try p1, otherwise we try pattern p2. to the position pc1 (as indicated in Figure 3). Instead of Instead of generating a trivial ordered choice code, we can TestChar x l hpc, i, ei −−−−−−−−−−→ hpc + 1, i + 1, ei S[i] = x TestChar x l hpc, i, ei −−−−−−−−−−→ hpc + l, i, ei S[i] 6= x TestAny n l hpc, i, ei −−−−−−−−−→ hpc + 1, i + n, ei i + n ≤ |S| TestAny n l hpc, i, ei −−−−−−−−−→ hpc + l, i, ei i + n > |S|

Figure 4: Operational Semantics of Head-fail Instructions generate the following optimized code: We first introduce a countably infinite set of variables, or non-terminals, called V . We will refer to members of this Π(‘c’p1/p2) ≡ TestChar c |Π(p1)| + 3 set with the notation Ak, and extend our set of patterns to Choice |Π(p1)| + 2 1 include all variables. We now define a grammar as a par- Π(p1) tial function from the set of variables to the set of patterns, such that g(Ak) represents the pattern associated with non- Commit |Π(p2)| + 1 terminal Ak of grammar g. Finally, we now extend the do- Π(p2) main of our match function to Grammar×Pattern×Σ∗ ×N . The sequence of transitions below illustrates the case when Figure 5 shows the semantics of the extended match. ‘c’ fails, considering that p2 matches: We also introduce another new kind of pattern, the closed grammar, a pair Grammar × V . Informally, a closed gram- T estChar −−−−−−→hpc + |Π(p1)| + 3, i, ei mar matches a subject if the pattern referenced by the vari-

Π(p2) able matches the subject, with all variables resolved in the −−−−→hpc + |Π(p1)| + |Π(p2)| + 3, i + k, ei grammar. Thus, if Ak is a non-terminal of grammar g, such As the matching of the character ‘c’ fails, the machine jumps that all variables of pattern g(Ak) are resolved in g, then directly to the beginning of Π(p2), without the need to exe- (g, Ak) is a closed grammar. Closed grammars build pat- cute the Choice instruction only to backtrack later. terns from grammars, and allows different grammars to be We can do similar optimizations for other patterns, such composed using the standard PEG operators. Figure 5 also as the not predicate. has the semantics of match for closed grammars. Taking again the ordered choice as example, let’s suppose Now we will present how to translate these concepts to our that we have an ordered choice of the form ‘c1’p1/‘c2’p2, parsing machine, and prove the correctness of our transla- where c1 and c2 are different characters. In this case, we tion. Our transformation Π will now operate on the domain have that if c1 matches, we know that c2 will not match, Grammar × N × Pattern, where Π(g, i, p) is the translation and we do not need to push a backtrack entry at all. The of pattern p in the context of grammar g and with position i corresponding sequence of instructions: relative to the beginning of the closed grammar which con- tains it (if the pattern is not part of a closed grammar then Π(‘c1’p1/‘c2’p2) ≡ TestChar c1 |Π(p1)| + 2 the value can be arbitrary). We will show shortly how both Π(p1) g and i are used to resolve variable references, represent- ing the “linking” step of the transformation from patterns to Jump |Π(p2)| + 2 programs. Π(‘c ’) 2 Extending the transformations given in the previous sec- Π(p2) tions is straightforward, and a matter of passing g to sub- The following sequence of transitions shows the case when programs and keeping i correctly updated so as to keep its intended meaning. Appendix A gives the extended trans- ‘c1’ and p1 match: formation. This extension is conservative, so the previous −−−−−−→hT estChar pc + 1, i + 1, ei proofs will remain valid once we prove the correctness of the transformation for the patterns we introduced in this Π(p1) −−−−→hpc + |Π(p1)| + 1, i + j + 1, ei section. Jump To give the results of Π for variables and closed grammars, −−−−→hpc + |Π(p1)| + |Π(p2)| + |Π(‘c2’)| + 2, i + j + 1, ei we first define a transformation Π0 from Grammar × N to First the TestChar instruction is executed, and then the Program. This is the invariant part of the transformation machine executes Π(p1). After this, as there is no backtrack of all closed grammars (g, An) based on the grammar g. We entry to discard, the machine simply jumps to end of the program. 5. GRAMMARS In the previous sections we were considering only simple patterns in our formalization. Using only the abstract syn- tax in Table 1 we can’t build patterns that reference them- selves; the only way to use another pattern is to include it. We will now correct this by extending our previous definition of patterns with grammars. s[i] = ‘c’ s[i] 6= ‘c’ Matching a character (ch.1) (ch.2) match g ‘c’ s i = i+1 match g ‘c’ s i = nil

i ≤ |s| i > |s| Matching any character (any.1) (any.2) match g . s i = i+1 match g . s i = nil

match g p s i = nil match g p s i = i+j Not Predicate (not.1) (not.2) match g !p s i = i match g !p s i = nil

match g p s i = i+j match g p s i = nil And Predicate (and.1) (and.2) match g &p s i = i match g &p s i = nil

match g p s i = i+j match g p s i + j = i+j+k Concatenation 1 2 (con.1) match g p1p2 s i = i+j+k

match g p s i = i+j match g p s i + j = nil match g p s i = nil 1 2 (con.2) 1 (con.3) match g p1p2 s i = nil match g p1p2 s i = nil

match g p s i = nil match g p s i = nil Ordered Choice 1 2 (ord.1) match g p1/p2 s i = nil

match g p s i = i+j match g p s i = nil match g p s i = i+k 1 (ord.2) 1 2 (ord.3) match g p1/p2 s i = i+j match g p1/p2 s i = i+k

match g p s i = i+j match g p ∗ s i + j = i+j+k match g p s i = nil Repetition (rep.1) (rep.2) match g p ∗ s i = i+j+k match g p ∗ s i = i

match g g(A ) s i = i+j match g g(A ) s i = nil Variables k (var.1) k (var.2) match g Ak s i = i+j match g Ak s i = nil

match g g(Ak) s i = i+j match g g(Ak) s i = nil Closed Grammars 0 (cg.1) 0 (cg.2) match g (g, Ak) s i = i+j match g (g, Ak) s i = nil

Figure 5: Operational Semantics of PEG Patterns with Grammars define Π0(g, i) to be For variables, we are only interested in the use of variables that are embedded in a closed grammar; other uses are left Π(g, i, g(A1)) unspecified. This is not a problem in practice, as the seman- Return tics of match have that match g Ak s i = match g (g, Ak) s i. . We have to prove that: . If match g Ak s i = i+j then k−1 X Π(g,l,Ak) Π(g, i + | Π(g, x, Aj ) | +1, g(Ak)) hpc, i, ei −−−−−−→hpc+ |Π(g, l, Ak)|, i + j, ei j=1 This case is covered by the semantic rule (var.1), and the Return derivation for it is straightforward: . . Call . −−−→hpc + o(g, Ak) − l, i, (pc + 1) : ei

n−1 Π(g,o(g,Ak),Ak) X −−−−−−−−−−−→hpc + o(g, Ak) − l+ | Π(g, o(g, Ak),Ak) |, Π(g, i + | Π(g, x, Aj ) | +1, g(An)) j=1 i + j, (pc + 1) : ei Return −−−−−→hReturn pc + 1, i + j, ei where A1,...,Ak,...,An are all variables where g has an Again, given how o is defined, o(g, Ak) − l is the offset from image, ordered according to any arbitrary (but consistent) the current position to the first instruction of Π(g, o(g, Ak),Ak). ordering. The x as the position for the sizes of subprograms From there the proof follows by induction, via the semantic means that the size of a program is independent on the po- rule (var.1). The failure case is analogous, via rule (var.2). sition. Given Π0 we can define a function o : Grammar × V → N Pk−1 6. RELATED WORK as (| Π(g, x, Aj ) | +1) (where the 1 represents the Re- j=1 The main influence of our parsing machine was Knuth’s turn instruction), which takes a grammar g and a variable parsing machine [9]. But unlike Knuth’s machine, where A defined in g, and gives the position of the program de- k calls to non-terminals and backtrack entries are the same rived from A in Π0(g, i), relative to i. k thing, our machine has a clear distinction between backtrack With Π0 and o, it is easy to extend Π to variables and entries and calls. This distinction is essential to be able to closed grammars: compile patterns without rewriting them first.

Π(g, i, Ak) ≡ Call o(g, Ak) − i One formal basis for pattern matching is regular expres- sions [14]. Pure regular expressions, however, have several 0 limitations that make the description of some patterns dif- Π(g , i, (g, Ak)) ≡ Call o(g, Ak) ficult. Because of this, most pattern-matching tools make Jump | Π0(g, x) | +1 ad-hoc extensions to the original regular expressions (the ex- Π0(g, 2) tended regular expressions are popularly called regexes) los- ing their formal basis. As we did not want such a mixture of The 2 in Π0(g, 2) keeps the invariant that all positions are formal and ad-hoc features, we decided to use a PEG-based relative to the first instruction of the closed grammar. approach instead of regexes. Now we will prove the correctness of our transformations, If we compare our PEG machine with other PEG-based beginning with closed grammars. We first have to prove approaches, a key difference is that most of them are more that: focused on parsing instead of pattern-matching. One well known approach is Packrat [1], which uses a memoization If match g0 (g, A ) s i = i+j then k algorithm. Packrat is used in several parsing applications, 0 Π(g ,l,(g,Ak)) 0 such as Rats! [4], a parser generator based on PEGs. As hpc, i, ei −−−−−−−−−→hpc+ | Π(g , l, (g, Ak)) |, i + j, ei we mentioned earlier, the linear space cost of Packrat makes This case is covered by the semantic rule (cg.1), and we have it unsuitable for a pattern-matching application, where we the following sequence of transitions: have to deal with larger inputs than in parsing applications. Another use of PEGs for parsing is described by [13], Call −−−→hpc + o(g, Ak), i, (pc + 1) : ei which implements a Java parser directly from the PEG def-

Π(g,o(g,Ak),Ak) initions of the Java syntax. The author of [13] also points −−−−−−−−−−−→hpc + o(g, Ak)+ | Π(g, o(g, Ak),Ak) |, i + j, that during the implementation of the parser, he noticed (pc + 1) : ei much of the backtracking and repetitions were done when Return dealing with lexical structures, which suggests that some op- −−−−−→hpc + 1, i + j, ei timizations should be done to improve the performance of Jump −−−−→hpc+ | Π0(g, x) | +2, i + j, ei the parser (Rats!, for example, does not use PEG to define identifiers, keywords, and operators [13]). Our machine ad- The second transition follows from the way we constructed dressed this problem by defining instructions like Partial- the o function, with pc + o(g, Ak) addressing the first in- Commit, which is used in repetitions, and the head-fail opti- struction of Π(g, o(g, Ak),Ak). Now we use our induction mization instructions, which avoid the backtracking when a hypothesis, via the semantic rule (cg.1). The last transition pattern fails in its first character. 0 0 uses the identity | Π (g, x) | +2 =| Π(g , l, (g, Ak)) |. The A PEG-based approach that is more similar to ours, in the proof for the failure case is analogous, via rule (cg.2). sense that it also focus on pattern-matching, is OMeta [15]. Size of input LPEG Lex/Yacc leg Size of input LPEG Lex/Yacc leg 470k 27 20 27 610k 20 20 27 2460k 107 100 147 3050k 93 113 150 4950k 217 207 297 6130k 187 227 303

Table 2: Time for Parsing Arithmetic Expressions Table 3: Time for Parsing Lists

Size of input LPEG Lex/Yacc leg OMeta extends the definition of PEGs to deal with arbitrarily- 620k 33 23 40 structured data instead of character strings. OMeta, how- 2440k 130 110 147 ever, does not define any new approach to parse the data, 3680k 187 173 227 but relies on a recursive-descent parser without memoiza- tion. Table 4: Time for Parsing a Simple Language Another related work is parser combinators [5], a popular approach for building recursive-descent parsers in the world of functional programming. By using parsing combinators start <- (list ’\n’)+ it is possible to define a parser through the combination list <- ’(’ space (term (space term)*)? ’)’ space of several parsers, what is similar to our approach of com- term <- list / number bining programs. But parser combinators use unrestricted number <- ’-’?[0-9]+ backtracking, with a large space and time cost, and imple- space <- [\t ]* mentations on strict languages is harder [5]. The performance of the parsers is presented in Table 3. For One drawback of parser combinators is that most imple- this simple definition of a list LPEG presented the best per- mentations are not efficient in space or time [10]. Some ap- formance. proaches that address the performance issue depend mainly The last parser was a parser of a simple language (similar on lazy evaluation [10], which implies the use of memoiza- to Scheme), which is presented in [3]. We considered only tion. the syntax that appears on page 71, plus the if-then-else on Our PEG machine, on the other hand, is easily imple- page 80. The grammar is presented below: mented in a low level language, with a good performance. To show this point, we will compare the performance of LPEG list <- (program ’\n’)+ with two other tools. One of the tools is Lex/Yacc, and the program <- exp other one is leg [12], a PEG-based tool that generates a exp <- number / if exp then exp else exp / id recursive-descent parser. Both tools produce a C program, / primitive space ’(’ exp (’,’ space exp)* ’)’ space which is statically compiled. LPEG, on the other hand, com- primitive <- ’+’ / ’-’ / ’*’ / "add1" / "sub1" piles its patterns dynamically, thus it is possible to modify number <- [0-9]+ space a pattern or create a new one during runtime. space <- [ \t]* When using Lex/Yacc, we took the traditional approach, space1 <- [ \t]+ where Lex deals with the lexical part and Yacc with the letter <- [a-zA-Z_] syntactic one. if <- "if" space1 To eliminate the overhead of reading the input files, all then <- "then" space1 tests read all the input before doing the parsing. To measure else <- "else" space1 the parsing time we used the clock function in C, and the reserved <- ("if" / "then" / "else" / "add1" / "sub1") os.clock function in Lua. Each program was run three times !(letter / number) against each input, and the mean time was considered. id <- !reserved letter (letter / [0-9])* space The first parser that we defined was a parser of arithmetic expressions, whose PEG definition is shown below: The performance of each parser can be seen in Table 4. In this case, Lex/Yacc had the best performance, followed start <- (exp ’\n’)+ by LPEG and leg, respectively. exp <- factor (factorOp factor)* From the benchmarks, we can see that LPEG and Lex/Yacc factor <- term (termOp term)* have a very similar performance. We consider LPEG as a term <- number space / ’(’ exp ’)’ space better choice to do simple tasks. LPEG is also a simpler factorOp <- [+-] space tool than Lex/Yacc, and should be easier to use it. termOp <- [*/] space termOp <- [*/] space number <- ’-’?[0-9]+ 7. CONCLUSIONS space <- [\t ]* We presented a new approach for implementing PEGs, by compiling them to programs for a virtual parsing ma- Table 2 presents the parsing time (in milliseconds) of each chine. Our motivation was to use PEGs as a formal basis tool considering different input sizes. As we can see, the for pattern-matching, which needed an implementation with performances of LPEG and Lex/Yacc were similar, and both good space and time costs. were faster than leg. We gave the operational semantics of our parsing machine, The second parser was a parser of a list, where each ele- and a transformation from patterns to programs for the ma- ment of the list was a number or another list, and the ele- chine, along with the corresponding proof of correctness. ments were separated by spaces. The PEG definition of the In our implementation, this transformation is a compi- parser is shown below: lation process that is done during runtime. The compiler builds more complex programs from simpler ones, following [12] peg/leg - recursive-descent parser generators for C. rules specific to each PEG operator. Available at http://piumarta.com/software/peg/, The parsing machine’s performance is competitive with 2008. Visited on June 2008. the performance of popular parsing tools. [13] R. R. Redziejowski. Parsing expression grammar as a As the machine has a very simple execution model, some- primitive recursive-descent parser with backtracking. what similar to a real CPU, it seems a perfect candidate for Fundamenta Informaticae, 3–4(79):513–524, 2007. a Just-In-Time (JIT) compiler. A JIT compiler should in- [14] K. Thompson. Programming techniques: Regular crease considerably the performance of our implementation. expression search algorithm. Commun. ACM, Another future work is to extend the execution model of 11(6):419–422, 1968. our machine to handle infinite loops, and to make the use of [15] A. Warth and I. Piumarta. Ometa: an object-oriented patterns with possible (without rewriting the language for pattern matching. In DLS ’07: patterns). Proceedings of the 2007 symposium on Dynamic One point that was not discussed here was the use of se- languages, pages 11–19, New York, NY, USA, 2007. mantic actions during the matching process. They don’t ACM. affect the matching process but are very useful in practice. The use of semantic actions is presented in [7]. We will extend the formal model of the machine to include actions. APPENDIX A. EXTENDED TRANSFORMATION FROM 8. ACKNOWLEDGMENTS PATTERNS TO PROGRAMS We would like to thank Fabio Mascarenhas for many sug- gestions and useful comments about early versions of this Matching a Character paper. Π(g, i, ‘c’) ≡ Char c 9. REFERENCES [1] B. Ford. Packrat parsing:: simple, powerful, lazy, Concatenation linear time, functional pearl. SIGPLAN Not., 37(9):36–47, 2002. Π(g, i, p1 p2) ≡ Π(g, i, p1) Π(g, i + |Π(g, x, p1)|, p2) [2] B. Ford. Parsing expression grammars: a Ordered Choice recognition-based syntactic foundation. In POPL ’04:

Proceedings of the 31st ACM SIGPLAN-SIGACT Π(g, i, p1/p2) ≡ Choice |Π(g, x, p1)| + 2 symposium on Principles of programming languages, Π(g, i + 1, p ) pages 111–122, New York, NY, USA, 2004. ACM. 1 [3] D. P. Friedman, C. T. Haynes, and M. Wand. Commit |Π(g, x, p2)| + 1 Essentials of programming languages (2nd ed.). Π(g, i + |Π(g, x, p1)| + 1, p2) Massachusetts Institute of Technology, Cambridge, MA, USA, 2001. Not Predicate [4] R. Grimm. Better extensibility through modular Π(g, i, !p) ≡ Choice |Π(g, x, p)| + 2 syntax. In PLDI ’06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language Π(g, i + 1, p) design and implementation, pages 38–51, New York, FailTwice NY, USA, 2006. ACM. [5] G. Hutton and E. Meijer. Monadic Parsing in Haskell. Repetition Journal of Functional Programming, 8(4):437–444, July 1998. Π(g, i, p∗) ≡ Choice |Π(g, x, p)| + 2 [6] R. Ierusalimschy. Programming in Lua, Second Π(g, i + 1, p) Edition. Lua.Org, 2006. PartialCommit − |Π(g, x, p)| [7] R. Ierusalimschy. A Text Pattern-Matching Tool based on Parsing Expression Grammars. Software - Variables Practice and Experience, 2008 (to appear). [8] R. Ierusalimschy, L. H. de Figueiredo, and W. C. Π(g, i, Ak) ≡ Call o(g, Ak) − i Filho. Lua - an extensible extension language. Software - Practice and Experience, 26(6):635–652, 1996. Closed Grammars

[9] D. E. Knuth. Top-down syntax analysis. Acta Inf., 0 1:79–110, 1971. Π(g , i, (g, Ak)) ≡ Call o(g, Ak) [10] D. J. P. Leijen and H. J. M. Meijer. Parsec: Direct Jump | Π0(g, x) | +1 style monadic parser combinators for the real world. Π0(g, 2) Technical Report UU-CS-2001-35, Department of Information and Computing Sciences, Utrecht University, 2001. [11] LPEG: Parsing Expression Grammars For Lua. Available at http://www.inf.puc-rio.br/˜roberto/lpeg, 2008. Visited on April 2008.