Measuring and Extending Lr(1) Parser Generation A
Total Page:16
File Type:pdf, Size:1020Kb
MEASURING AND EXTENDING LR(1) PARSER GENERATION A DISSERTATION SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI‘I IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE AUGUST 2009 By Xin Chen Dissertation Committee: David Pager, Chairperson YingFei Dong David Chin David Streveler Scott Robertson We certify that we have read this dissertation and that, in our opinion, it is satisfactory in scope and quality as a dissertation for the degree of Doctor of Philosophy in Computer Science. DISSERTATION COMMITTEE Chairperson ii Copyright 2009 by Xin Chen iii To my family iv Acknowledgements I wish to thank my family for their love and support. I thank my advisor, professor David Pager, for his guidance and supervision. Without him this work would not be possible. Every time when I was stuck at an issue, it was the discussion with him that could lead me through the difficulty. I thank my PhD committee members for giving me feedback on my work, as well as occasional talks and suggestions. I would regret the unfortunate pass away of professor Art Lew and miss his passion and dedication on research. I would like to thank many people in the compiler and parser generator field that I have commu- nicated to in person, by email or via online discussion. From them I received invaluable suggestions and have learned a lot: Francois Pottier, Akim Demaille, Paul Hilfinger, Joel Denny, Paul Mann, Chris Clark, Hans Aberg, Hans-Peter Diettrich, Terence Parr, Sean O’Connor, Vladimir Makarov, Aho Alfred, Heng Yuan, Felipe Angriman, Pete Jinks and more. I would like to thank my GA supervisor Shi-Jen He for allowing me more freedom to concen- trate on research in the final phase. Thanks also go to people not mentioned here but who have nevertheless supported my work in one way or another. v ABSTRACT Commonly used parser generation algorithms such as LALR, LL and SLR all have their restric- tions. The canonical LR(1) algorithm proposed by Knuth in 1965 is regarded as the most powerful parser generation algorithm for context-free languages, but is very expensive in time and space costs and has long been considered as impractical by the community. There have been LR(1) algorithms that can improve the time and space efficiency of the canonical LR(1) algorithm, but there had been no systematic study of them. Good LR(1) parser generator implementation is also rare. LR(k) parser generation is even more expensive and complicated than LR(1), but it can be an alternative to GLR in natural language processing and other applications. To solve these problems, this work explored ways of improving data structure and algorithms, and implemented an efficient, practical and Yacc-compatible LR(0)/LALR(1)/LR(1)/LR(k) parser generator Hyacc, which has been released to the open source community. An empirical study was conducted comparing different LR(1) parser generation algorithms and with LALR(1) algorithms. The result shows that LR(1) parser generation based upon improved algorithms and carefully se- lected data structures can be sufficiently efficient to be of practical use with modern computing facilities. An extension was made to the unit production elimination algorithm to remove redundant states. A LALR(1) implementation based on the first phase of lane-tracing was done, which is an- other alternative of LALR(1) algorithm. The second phase of lane-tracing algorithm, which has not been discussed in detail before, was analyzed and implemented. A new LR(k) algorithm called the edge-pushing algorithm, which is based on recursively applying the lane-tracing process, was de- signed and implemented. Finally, a latex2gDPS compiler was created using Hyacc to demonstrate its usage. vi Contents Acknowledgements v Abstract vi List of Tables xii List of Figures xiv 1 Introduction 1 2 Background and Related Work 4 2.1 Parsing Theory . .4 2.1.1 History of Research on Parsing Algorithms . .4 2.1.2 Classification of Parsing Algorithms . .5 2.2 LR(1) Parsing Theory . .8 2.2.1 The Canonical LR(k) Algorithm of Knuth (1965) . .8 2.2.2 The Partitioning Algorithm of Korenjak (1969) . .8 2.2.3 The Lane-tracing Algorithm of Pager (1977) . .9 2.2.4 The Practical General Method of Pager (1977) . .9 2.2.5 The Splitting Algorithm of Spector (1981, 1988) . 10 2.2.6 The Honalee Algorithm of Tribble (2002) . 10 2.2.7 Other LR(1) Algorithms . 11 2.3 LR(1) Parser Generators . 11 vii 2.3.1 LR(1) Parser Generators Based on the Practical General Method . 11 2.3.2 LR(1) Parser Generators Based on the Lane-Tracing Algorithm . 12 2.3.3 LR(1) Parser Generators Based on Spector’s Splitting Algorithm . 13 2.3.4 Other LR(1) Parser Generators . 13 2.4 The Need For Revisiting LR(1) Parser Generation . 14 2.4.1 The Problems of Other Parsing Algorithms . 14 2.4.2 The Obsolete Misconception of LR(1) versus LALR(1) . 14 2.4.3 The Current Status of LR(1) Parser Generators . 15 2.5 LR(k) Parsing . 15 2.6 Conclusion . 17 3 The Hyacc Parser Generator 18 3.1 Overview . 18 3.2 Architecture of the Hyacc parser generator . 21 3.3 Architecture of the LR(1) Parse Engine . 23 3.3.1 Architecture . 23 3.3.2 Storing the Parsing Table . 25 3.3.3 Handling Precedence and Associativity . 33 3.3.4 Error Handling . 35 3.4 Data Structures . 35 4 LR(1) Parser Generation 38 4.1 Overview . 38 4.2 Knuth’s Canonical Algorithm . 39 4.2.1 The Algorithm . 39 4.2.2 Implementation Issues . 41 4.3 Pager’s Practical General Method . 50 4.3.1 The Algorithm . 50 4.3.2 Implementation Issues . 52 viii 4.4 Pager’s Unit Production Elimination Algorithm . 56 4.4.1 The Algorithm . 56 4.4.2 Implementation Issues . 59 4.5 Extension To The Unit Production Elimination Algorithm . 66 4.5.1 Introduction and the Algorithm . 66 4.5.2 Implementation Issues . 71 4.6 Pager’s Lane-tracing Algorithm . 72 4.6.1 The Algorithm . 72 4.6.2 Lane-tracing Phase 1 . 73 4.6.3 Lane-tracing Phase 2 . 76 4.6.4 Lane-tracing Phase 2 First Step: Get Lanehead State List . 76 4.6.5 Lane-tracing Phase 2 Based on PGM . 79 4.6.6 Lane-tracing Phase 2 Based on A Lane-tracing Table . 83 4.7 Framework of Reduced-Space LR(1) Parser Generation . 94 4.8 Conclusion . 95 5 Measurements and Evaluations of LR(1) Parser Generation 97 5.1 About the Measurement . 97 5.1.1 The Environment and Metrics Collection . 97 5.1.2 The Algorithms . 98 5.1.3 The Grammars . 99 5.2 LR(1), LALR(1) and LR(0) Algorithms . 100 5.2.1 Parsing Table Size Comparison . 100 5.2.2 Parsing Table Conflict Comparison . 103 5.2.3 Running Time Comparison . 105 5.2.4 Memory Usage Comparison . 107 5.3 Extension Algorithm to Unit Production Elimination . 109 5.3.1 Parsing Table Size Comparison . 109 ix 5.3.2 Parsing Table Conflict Comparison . 112 5.3.3 Running Time Comparison . 114 5.3.4 Memory Usage Comparison . 114 5.4 Comparison with Other Parser Generators . 118 5.4.1 Comparison to Dragon and Parsing . 118 5.4.2 Comparison to Menhir and MSTA . 119 5.5 Conclusion . 121 5.5.1 LR(1) and LALR(1) Algorithms . 121 5.5.2 The Unit Production Elimination Algorithm and Its Extension Algorithm . 122 5.5.3 Hyacc and Other Parser Generators . 122 6 LR(k) Parser Generation 124 6.1 LR(k) Algorithm . 125 6.1.1 LR(k) Parser Generation Based on Recursive Lane-tracing . 125 6.1.2 Edge-pushing Algorithm: A Conceptual Example . 128 6.1.3 The Edge-pushing Algorithm . 133 6.1.4 Edge-pushing Algorithm on Cycle Condition . 135 6.2 Computation of theads(α; k) ............................. 137 6.2.1 The Problem . 137 6.2.2 Literature Review on theads(α; k) Calculation . 139 6.2.3 The theads(α; k) Algorithm Used in Hyacc . 141 6.3 Storage of LR(k) Parsing Table . 145 6.4 LR(k) Parse Engine . 148 6.5 Performance . 149 6.6 Examples . 150 6.7 Lane-tracing at Compile Time . 164 6.8 More Issues . 166 7 The Latex2gDPS compiler 171 x 7.1 Introduction . 171 7.2 Design of the Latex2gDPS Compiler . 172 7.2.1 Overall Design . 172 7.2.2 Data Structures . 173 7.2.3 Use of Special Declarations . 174 7.3 Current Status . 175 7.4 An Example . 177 8 Conclusion 179 9 Future Work 181 9.1 Study of More LR(1) Algorithms . 181 9.2 Issues in LR(k) Parser Generation . ..