PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages Joel Denny Clemson University, [email protected]

Clemson University TigerPrints All Dissertations Dissertations 5-2010 PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages Joel Denny Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Part of the Computer Sciences Commons Recommended Citation Denny, Joel, "PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages" (2010). All Dissertations. 519. https://tigerprints.clemson.edu/all_dissertations/519 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Joel E. Denny May 2010 Accepted by: Dr. Brian A. Malloy, Committee Chair Dr. Harold C. Grossman Dr. Jason Hallstrom Dr. Stephen T. Hedetniemi Abstract Composite languages are composed of multiple sub-languages. Examples include the parser specification languages read by parser generators like Yacc, modern extensible languages with complex layers of domain-specific sub-languages, and even traditional programming languages like C and C++. In this dissertation, we describe PSLR(1), a new scanner-based LR(1) parser generation system that automatically eliminates scanner conflicts typically caused by language composition. The fundamental premise of PSLR(1) is the pseudo-scanner, a scanner that only recognizes tokens accepted by the current parser state. However, use of the pseudo-scanner raises several unique challenges, for which we describe a novel set of solutions. One major challenge is that practical LR(1) parser table generation algorithms merge parser states, sometimes inducing incorrect pseudo- scanner behavior including new conflicts. Our solution is a new extension of IELR(1), an algorithm we have previously described for generating minimal LR(1) parser tables. Other contributions of our work include a robust system for handling the remaining scanner conflicts, a correction for syntax error handling mechanisms that are also corrupted by parser state merging, and a mechanism to enable scoping of syntactic declarations in order to further improve the modularity of sub-language specifications. While the premise of the pseudo-scanner has been described by other researchers independently, we expect our improvements to distinguish PSLR(1) as a significantly more robust scanner-based parser generation system for traditional and modern composite languages. ii Table of Contents Title Page ............................................ i Abstract ............................................. ii List of Tables .......................................... v List of Figures ......................................... vi 1 Introduction......................................... 1 1.1 Problem Statement..................................... 2 1.2 Contributions of the Work................................. 2 1.3 Thesis Statement and Evaluation............................. 4 1.4 Dissertation Organization................................. 4 2 Background ......................................... 5 2.1 Notation........................................... 5 2.2 Scanner-based LR(1) Parser Generation......................... 6 2.2.1 Scanners....................................... 6 2.2.2 LR(1) Parsers.................................... 9 2.3 Composite Languages ................................... 15 2.3.1 Parser Specifications................................ 15 2.3.2 Regular Sub-languages............................... 19 2.3.3 Subtle Sub-languages ............................... 20 2.3.4 Extensible Languages ............................... 22 2.4 Summary .......................................... 22 3 Methodology ........................................24 3.1 Pseudo-scanner....................................... 24 3.2 Lexical Precedence..................................... 26 3.2.1 Guiding Principles................................. 27 3.2.2 Traditional Rules.................................. 29 3.2.3 Non-traditional Rules ............................... 31 3.2.4 Ambiguities..................................... 33 3.2.5 Self-consistency................................... 38 3.2.6 Pseudo-scanner Tables............................... 42 3.2.7 Resolver Algorithm................................. 45 3.2.8 Summary...................................... 51 3.3 Lexical Ties......................................... 51 3.4 Minimal LR(1)....................................... 57 3.4.1 LALR(1) Versus Canonical LR(1) ........................ 58 3.4.2 IELR(1)....................................... 59 iii 3.4.3 IELR(1) Extension for PSLR(1) ......................... 63 3.4.4 Summary...................................... 68 3.5 Syntax Error Handling................................... 68 3.5.1 Pseudo-scanner................................... 68 3.5.2 Parser........................................ 70 3.5.3 Summary...................................... 73 3.6 Whitespace and Comments ................................ 73 3.7 Scoped Declarations .................................... 77 3.8 Summary .......................................... 79 4 Studies and Evaluation ..................................81 4.1 PSLR(1) Bison....................................... 81 4.2 Levine SQL......................................... 82 4.3 ISO C99........................................... 84 4.4 Template Argument Lists ................................. 86 4.5 Results............................................ 87 4.5.1 Grammars and Parser Tables........................... 87 4.5.2 Readability and Maintainability ......................... 91 5 Related Work ........................................97 5.1 Nawrocki .......................................... 97 5.2 Keynes............................................ 99 5.3 Van Wyk and Schwerdfeger................................ 103 5.4 Scannerless GLR...................................... 107 6 Merits of the Work ....................................110 Bibliography ..........................................114 iv List of Tables 2.1 Parser Tables for Template Argument Lists ....................... 12 2.2 A Template Argument List Parse............................. 14 3.1 Unambiguous Sequential Lexical Precedence Function ................. 35 3.2 Opposing Lexical Precedence ............................... 41 3.3 IELR(1) Case Studies ................................... 60 3.4 IELR(1) Parser Tables................................... 61 3.5 IELR(1) Action Corrections................................ 62 4.1 Grammar Sizes....................................... 88 4.2 Parser Tables........................................ 88 4.3 Lexical Reveals....................................... 88 4.4 Specification Sizes ..................................... 92 4.5 Complexity Counts..................................... 95 4.6 Complexity Frequencies................................... 95 v List of Figures 2.1 Scanner-based Parsing................................... 6 2.2 Scanner Conflict Resolution................................ 6 2.3 Template Argument Lists ................................. 10 2.4 Yacc Parser Specification Language............................ 16 2.5 Lex Start Conditions.................................... 18 2.6 Scoped Declarations .................................... 21 3.1 Traditional Lexical Precedence Rules........................... 29 3.2 Non-traditional Lexical Precedence Rules ........................ 31 3.3 Intransitive Lexical Precedence.............................. 36 3.4 Token Precedence Mixed with Lexeme Precedence ................... 36 3.5 Shortest Match Mixed with Longest Match ....................... 36 3.6 Lexical Ties......................................... 52 vi Chapter 1 Introduction Grammar-dependent software is omnipresent in software development [21]. For example, compilers, document processors, browsers, import/export tools, and generative programming tools are used in software development in all phases. These phases include comprehension, analysis, maintenance, reverse-engineering, code manipulation, and visualization of the application program under study. However, construction of these tools relies on the correct recognition of the language constructs specified by the grammar. Some aspects of grammar engineering are reasonably well understood. For example, the study of grammars as definitions of formal languages, including the study of LL, LR, LALR, and SLR algorithms and the Chomsky hierarchy, form an essential part of most computer science curricula. Nevertheless, parsing as a disciplined study must be reconsidered from an engineering point of view [21, 22]. Many parser developers eschew the use of parser generators because it is too difficult to customize the generated parser or because the generated parser requires considerable modification to incorporate sufficient power to handle complex grammars such as the C++ and C# grammar. Thus, industrial strength parser development requires considerable effort, and many approaches to parser generation are ad hoc [35, 36]. In this dissertation, we address difficulties in the generation of scanner-based LR(1) parsers for composite

PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages Joel Denny Clemson University, [email protected]

Design a Lexical Analyser for a Language Whose Grammar Is Known

Compiler Construction

ANTLR Reference PDF Manual

Lexical Analyzer in C