Parsing with Earley Virtual Machines
Total Page:16
File Type:pdf, Size:1020Kb
Communication papers of the Federated Conference on DOI: 10.15439/2017F162 Computer Science and Information Systems, pp. 165–173 ISSN 2300-5963 ACSIS, Vol. 13 Parsing with Earley Virtual Machines Audrius Saikˇ unas¯ Institute of Mathematics and Informatics Akademijos 4, LT-08663 Vilnius, Lithuania Email: [email protected] Abstract—Earley parser is a well-known parsing method used EVM is capable of parsing context-free languages with data- to analyse context-free grammars. While being less efficient in dependant constraints. The core idea behind of EVM is to practical contexts than other generalized context-free parsing have two different grammar representations: one is user- algorithms, such as GLR, it is also more general. As such it can be used as a foundation to build more complex parsing algorithms. friendly and used to define grammars, while the other is used We present a new, virtual machine based approach to parsing, internally during parsing. Therefore, we end up with grammars heavily based on the original Earley parser. We present several that specify the languages in a user-friendly way, which are variations of the Earley Virtual Machine, each with increasing then compiled into grammar programs that are executed or feature set. The final version of the Earley Virtual Machine is interpreted by the EVM to match various input sequences. capable of parsing context-free grammars with data-dependant constraints and support grammars with regular right hand sides We present several iterations of EVM: and regular lookahead. • EVM0 that is equivalent to the original Earley parser in We show how to translate grammars into virtual machine it’s capabilities. instruction sequences that are then used by the parsing algorithm. • EVM1 is an extension of EVM0 that enables the use of Additionally, we present two methods for constructing shared regular operators within grammar rule definitions, thus packed parse forests based on the parse input. easing the development of new grammars. • EVM further extends EVM by allowing the use of I. INTRODUCTION 2 1 regular-lookahead operator. Parsing is one of the oldest problems in computer science. • Finally, EVM3 is an extension of EVM2 that enables Pretty much every compiler ever written has a parser within it. general purpose computation during parsing, and as such Even in applications, not directly related to computer science allows to conditionally control the parsing process based or software development, parsers are a common occurrence. on the results of previously parsed data. Therefore, EVM3 Date formats, URL addresses, e-mail addresses, file paths are can be used to recognize data-dependant language con- just a few examples of everyday character strings that have structs that cannot be parsed by more traditional parsing to be parsed before any meaningful computation can be done methods. with them. It is probably harder to come up with everyday In section III we present two separate methods for con- application example that doesn’t make use of parsing in some structing the abstract syntax trees (or more precisely, shared way rather than list the ones that do. packed parse forests) based on the input data. Automatic Because of the widespread usage of parsers, it no surprise AST construction enables automatic construction of shared that there are numerous parsing algorithms available. Many packed parse forests, without any changes to the grammar. consider the parsing problem to be solved, but the reality On the other hand, manual AST construction requires explicit couldn’t be farther from the truth. Most of the existing parsing instructions from the user to control the AST construction. algorithms have severe limitations, that restrict the use cases of these algorithms. One of the newest C++ programming lan- II. EARLEY VIRTUAL MACHINE guage compiler implementations, the CLang, doesn’t make use A. EVM structure of any formal parsing or syntax definition methods and instead For every terminal input symbol inputi EVM creates a state use a hand-crafted recursive descent parser. The HTML5 is Si. A state Si is a tuple hinputi, F, F Si. arguably one of the most important modern standards, as it Each state Si contains a set of fibers F . Fibers in EVM defines the shape of the internet. Yet the syntax of HTML5 loosely correspond to items in Earley parser. Each fiber is a documents is defined by using custom abstract state machines, tuple hip, origini. ip is the instruction pointer of the current as none of the more traditional parsing methods are capable fiber and origin is the origin state number. The origin value of matching closing and opening XML/HTML tags. indicates the input offset where the currently parsed rule has It is clear that more flexible and general parsing methods are begun. In a away, the origin value may be considered as the needed that are capable of parsing more than only context-free return ”address” for the rule. grammars. Additionally, each state has a set of suspended fibers FS. As such, we present a new approach to parsing: the Earley Fibers are suspended when they invoke different rules/non- Virtual Machine, or EVM for short. It is a virtual machine terminal symbols. A fiber remains suspended until the appro- based parser, heavily based on the original Earley parser. priate non-terminal symbol is matched, at which point the fiber c 2017, PTI 165 166 COMMUNICATION PAPERS OF THE FEDCSIS. PRAGUE, 2017 is resumed by copying it to the set of running fibers F in the TABLE I current state. EVM0 GRAMMAR COMPILATION RULES An EVM0 grammar is a set of productions in form sym → Grammar element Instruction sequence body, where sym is a non-terminal symbol and body is a i_call_dyn main grammar expression. i_match_sym main → laccept An EVM grammar expression is defined recursively as: laccept : 0 Grammar: i_accept • a is a terminal grammar expression, where a is a terminal G = {P1, ..., Pn} i_stop code(P1) symbol. ... • A is a non-terminal grammar expression, where A is a code(Pn) code(e) non-terminal symbol. Production rule: i_reduce P • P e ǫ is an epsilon grammar expression. → i_stop • (e) is a brace grammar expression, where e is a grammar Terminal grammar expression: i_match_char a → ip expression. a next Non-terminal grammar expression • e e is a sequence grammar expression, where e and e i_call_dyn A 1 2 1 2 (dynamic): i_match_sym A → ip are grammar expressions. A next Epsilon grammar expression: Every EVM grammar program is a tuple ǫ hinstrs, rule mapi. instrs is the sequence of instructions Brace grammar expression: code(e) that represents the compiled grammar. rule map is mapping (e) from non-terminal symbols to locations in the instruction Sequence grammar expression: code(e1) e1e2 code(e2) sequence, which represents entry points for the grammar program. It is used to determine the start locations of compiled rules for specific non-terminal symbols. Every instruction in EVM completes with one of the fol- • i_match_char char1 → ip1, ..., charn → ipn. This lowing results: instruction is used to match terminal symbols. If the current input symbol in matches one of the instruc- • r_continue. This result is used to indicate that the curr char ip , origin current fiber should continue executing instructions in tion operands i then a new fiber h i i in S order. That means that after executing an instruction ip state curr+1 is created. After executing this instruction, of current fiber should be increased by 1. the current fiber is discarded (it completes with result r_discard) • r_continue_to ipnew. This behaves exactly like • i_reduce sym r_continue, however ip of current fiber is set to ip . This instruction performs the reduction new sym after completing execution of an instruction. of non-terminal symbol . It is used to indicate that a grammar rule with product sym has matched • r_discard. This result is used to indicate that the current fiber needs to be terminated. successfully. The instruction finds all suspended fibers in state S that have been previously suspended with • r_suspend. This result is used to indicate that the origin i_match_sym current fiber needs to be suspended. instruction and resumes them in state Scurr. Only those fibers are resumed, which have sym To correctly represent all Earley parser grammars, the among their operands. The fibers are resumed by creating following instructions are required: a copy of suspended fiber in the current state with updated • i_call_dyn sym. This instruction is used to begin instruction pointer. The instruction completes with result parsing of non-terminal symbol sym. It dynamically r_continue. invokes all rules with product sym. This constitutes creat- • i_stop. This instruction stops and discards the current ing new fibers in state Scurr with ip pointing to beginning fiber. It is used to destroy the current fiber when a of every rule with product sym and origin curr. The parse rule is matched successfully, usually immediately instruction completes with result r_continue. Because after executing a i_reduce instruction. The instruction the fibers of every state are stored in a set, multiple or completes with r_discard. recursive invocations of the same product have no effect. • i_accept. This instruction is used to indicate the parse • i_match_sym sym1 → ip1, ..., symn → ipn. This input is valid and can be accepted. instruction is used to detect a successful reduction of one EVM0 grammar compilation rules are shown in table I. or more non-terminal symbols sym1, ..., symn. When Productions of a grammar are compiled in sequence. code(E) symi is reduced, a new fiber is created with ip pointing represents instruction sequence for grammar element E, where to corresponding ipi. The instruction operands essentially form a jumptable. Whenever instruction i_match_sym E may be a grammar, a production rule or a grammar is executed, the current fiber is suspended and moved to expression.