Full LR(1) Parser Generator Hyacc and Study on the Performance Of

Full LR(1) Parser Generator Hyacc And Study On The Performance of LR(1) Algorithms Xin Chen David Pager Department of Information and Computer Science Department of Information and Computer Science University of Hawaii at Manoa University of Hawaii at Manoa POST Building 317, 1680 East-West Road POST Building 317, 1680 East-West Road Honolulu, HI 96822 Honolulu, HI 96822 (808) 226-3584 (808) 292-5629 [email protected] [email protected] ABSTRACT used to parser generators such as Yacc [2] and later Bison [3], Despite the popularity of LALR(1) parser generators such as which used the less powerful but more practical LALR(1) Yacc/Bison and LL parser generators such as ANTLR, robust and algorithm. LL parser generators such as ANTLR [4] started to effective LR(1) parser generators are rare due to expensive gain popularity since 1990s. However robust and practical LR(1) performance and implementation difficulty. This work employed parser generators keep scarce, not to say the more expensive case relevant algorithms, including the Knuth canonical algorithm, of LR(k). Pager’s practical general method, lane-tracing algorithm, unit To address this issue, this work has developed Hyacc [5], an production elimination algorithm and its extension, and the edge- efficient, practical and Yacc/Bison-compatible open source full pushing algorithm, implemented an efficient, practical and LR(0)/LALR(1)/LR(1) and partial LR(k) parser generation tool in Yacc/Bison-compatible open-source parser generator Hyacc, ANSI C, based on the canonical algorithm of Knuth [1], the lane- which supports full LR(0)/LALR(1)/LR(1) and partial LR(k). tracing algorithm of Pager [6][7], the practical general method of Based on the implementation, an empirical study was conducted Pager [8], the unit production elimination algorithm of Pager [9] comparing different LR(1) parser generation algorithms and and its extension [10], and a new partial LR(k) algorithm called LALR(1) algorithms. The result shows that LR(1) parser the edge-pushing algorithm [10]. generation based upon improved algorithms and carefully selected data structures can be sufficiently efficient to be of practical use Based on the implementation of Hyacc, we investigated with modern computing facilities. details in the existing LR(1) algorithms, and compared the performance of LR(1) algorithms (Knuth’s canonical algorithm [1], Pager’s lane-tracing algorithm [6][7] and Pager’s practical Categories and Subject Descriptors general method [8]) as implemented in Hyacc with the LALR(1) D.3.4 [ Programming Languages ]: Processors – code generation, algorithm as implemented in Bison with regard to the size of parsing, translator writing systems and compiler generators. parsing machine, conflict resolution, as well as running time and space. The performance study was conducted on 13 programming General Terms languages, including Ada, ALGOL60, COBOL, Pascal, Delphi, C, Algorithms, Performance. C++, Java and more. Keywords 1.2 Relevant Parser Generation Algorithms LR(1), Parser Generator, Hyacc, Algorithm, Performance. The three LR(1) algorithms used in Hyacc are actually all LR(k) algorithms. We only employed the case k = 1, because when k > 1 it is computationally expensive and also hard to implement. 1) The canonical algorithm of Knuth (1965) [1]. This 1. INTRODUCTION algorithm is more powerful than LALR(1) and LL parser generation algorithms, and has less restrictions on the structure of 1.1 Overview the grammar. However it was criticized in the past for being In 1965 Knuth proposed the canonical LR(k) algorithm [1], practically infeasible, since its worse case computational which is a very powerful parser generation algorithm for context- complexity grows exponentially. 2) The lane-tracing algorithm of free grammars. But it was criticized for being too expensive in Pager (1977) [6][7]. This algorithm first generates a LR(0) time and space to be practical at that time. Despite some relevant parsing machine, then splits those states that cause conflicts. If the research to reduce the cost of LR(k) algorithm, people soon get grammar is LR(1), such splitting would resolve all the conflicts. 3) The practical general method of Pager (1977) [8]. In contrast to the lane-tracing algorithm, the practical general method solves the Permission to make digital or hard copies of all or part of this work for conflict problem by merging instead of splitting. It generates all personal or classroom use is granted without fee provided that copies are the states in the full LR(1) parsing machine. But when it does so, not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy it merges compatible states along the way. The merging used in otherwise, or republish, to post on servers or to redistribute to lists, Hyacc is based on the concept of weak compatibility. requires prior specific permission and/or a fee. The LR(0) algorithm used in Hyacc is the traditional LR(0) C3S2E-11 2011 , May 16–18, 2011, Montreal [QC, CANADA] Editors: Abran, Desai, Mudur algorithm. The LALR(1) algorithm used in Hyacc is based on the Copyright ©2011 ACM 978 -1-4503 -0626 -3/11/05 $10.00 first phase of the lane-tracing algorithm. Both are implemented The typical viewpoints on the comparison of LR(1) and because Pager’s lane-tracing algorithm depends on these as the LALR(1) algorithms are: i) Although a subset of LR(1), LALR(1) first step. can cover most programming language grammars. ii) The size of the LALR(1) parsing machine is smaller than the LR(1) parsing Unit productions are those with only one symbol on the right machine. iii) Each shift/reduce conflict in a LALR(1) parsing hand side of the production. Removal of unit productions from a machine also exists in the corresponding LR(1) parsing machine. grammar decreases the size of the parsing machine and increases iv) “mysterious” reduce/reduce conflicts exist in LALR(1) parsing parsing speed. Parsing machines generated by Pager’s unit machines but not in LR(1) parsing machines, and “presumably” production algorithm are found to contain duplicate states. this can be handled by rewriting the grammar. Therefore an extension algorithm [10] is designed to remove the redundancy. Pager’s unit production elimination algorithm and However, the LR(1) parser generation algorithm is superior the extension algorithm are both implemented into Hyacc. in that the set of LR(1) grammars is a superset of LALR(1) grammars, and the LR(1) algorithm can resolve the “mysterious Finally, the edge-pushing LR(k) algorithm [10] is based on reduce/reduce conflicts” that cannot be resolved using LALR(1) recursively applying the lane-tracing process to split further on the algorithm. Compiler developers may spend days after days unresolved conflict states. modifying the grammar in order to remove reduce/reduce conflicts Besides the algorithms used in Hyacc, we also briefly review without guaranteed success, and the modified grammar may not a couple of relevant ones below. be the same language as initially desired. Besides, despite the general claim that LR(1) parsing machines are much bigger than Korenjak’s Partitioning Algorithm in 1969 [13] is to LALR(1) parsing machines, the actual fact is that a LR(1) parsing partition a large grammar into small parts, check each part to see machine can be of the same size as a LALR(1) parsing machine whether it is LR(1), generate Knuth canonical LR(1) parsing for LALR(1) grammars. Only for LR(1) grammars that are not table, and combine these small tables into a large LR(1) parsing LALR(1), LR(1) parsing machines are much bigger. Further, there table for the original grammar. exist algorithms that can reduce the running time and parsing Spector first proposed his Splitting Algorithm in 1981 [11], table size, such as those by Pager [6][7][8] and Spector [11][12]. based on splitting the inadequate states of an LR(0) parsing Third is the current status of LR(1) parser generators. There machine. In that sense it is similar to the lane-tracing algorithm of is a scarcity of good LR(1) parser generators, especially with Pager. He further refined his algorithm in 1988 [12]. He did not reduced-space algorithms. Many people even have no idea of the have a formal proof of the validity of the algorithm, and only gave existence of such algorithms. some examples to show how it worked. Considering all the advantages that LR(1) parser generation In addition, there is a general conception of how LR(1) can can provide, we feel it is beneficial to revisit the LR(1) parser be achieved by starting from a LR(0) parsing machine and generation problem, conduct a thorough investigation and provide splitting those states that cause conflicts. This in concept is very a practical solution, so as to bring the power of LR(1) parser similar to the lane-tracing algorithm of Pager and the splitting generation to life. algorithm of Spector. 2. THE HYACC PARSER GENERATOR 1.3 The Need For Revisiting LR(1) Parser Generation 2.1 Overview First is the inadequacy of other parsing algorithms. These Hyacc is an efficient, practical and Yacc/Bison-compatible other parsing algorithms include SLR, LALR, LL and GLR. SLR open source parser generator, written from scratch in ANSI C. It is too restrictive in recognition power. GLR often uses LR(0) or supports full LR(0)/LALR(1)/LR(1) and partial LR(k). Hyacc is LALR(1) in its engine. GLR [14] branches into multiple stacks for pronounced as “HiYacc”, means Hawaii Yacc. Current features of different parse options, eventually disregards the rest and only Hyacc include: keeps one, which is very inefficient and is mostly used on natural 1) Implements the original Knuth canonical algorithm.

Load more