Principled Procedural Parsing Nicolas Laurent
Total Page:16
File Type:pdf, Size:1020Kb
Principled Procedural Parsing Nicolas Laurent August 2019 Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Applied Science in Engineering Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM) Louvain School of Engineering (EPL) Université catholique de Louvain (UCLouvain) Louvain-la-Neuve Belgium Thesis Committee Prof. Kim Mens, Advisor UCLouvain/ICTEAM, Belgium Prof. Charles Pecheur, Chair UCLouvain/ICTEAM, Belgium Prof. Peter Van Roy UCLouvain/ICTEAM Belgium Prof. Anya Helene Bagge UIB/II, Norway Prof. Tijs van der Storm CWI/SWAT & UG, The Netherlands Contents Contents3 1 Introduction7 1.1 Parsing............................7 1.2 Inadequacy: Flexibility Versus Simplicity.........8 1.3 The Best of Both Worlds.................. 10 1.4 The Approach: Principled Procedural Parsing....... 13 1.5 Autumn: Architecture of a Solution............ 15 1.6 Overview & Contributions.................. 17 2 Background 23 2.1 Context Free Grammars (CFGs).............. 23 2.1.1 The CFG Formalism................. 23 2.1.2 CFG Parsing Algorithms.............. 25 2.1.3 Top-Down Parsers.................. 26 2.1.4 Bottom-Up Parsers.................. 30 2.1.5 Chart Parsers..................... 33 2.2 Top-Down Recursive-Descent Ad-Hoc Parsers....... 35 2.3 Parser Combinators..................... 36 2.4 Parsing Expression Grammars (PEGs)........... 39 2.4.1 Expressions, Ordered Choice and Lookahead... 39 2.4.2 PEGs and Recursive-Descent Parsers........ 42 2.4.3 The Single Parse Rule, Greed and (Lack of) Ambi- guity.......................... 43 2.4.4 The PEG Algorithm................. 44 2.4.5 Packrat Parsing.................... 45 2.5 Expression Parsing...................... 47 2.6 Error Reporting........................ 50 2.6.1 Overview....................... 51 2.6.2 The Furthest Error Heuristic............ 52 2.6.3 Associating Errors With Custom Messages.... 53 3 4 CONTENTS 2.6.4 Error Recovery.................... 54 2.7 Further Related Work.................... 59 3 Autumn’s Basics 61 3.1 Introductory Example.................... 62 3.1.1 DSL, rule, parsers and combinators......... 66 3.1.2 Whitespace Handling & String Literals....... 67 3.1.3 lazy and sep ..................... 67 3.1.4 Launching the Parse................. 68 3.2 Parser and Parse ....................... 68 3.3 Building An Abstract Syntax Tree (AST)......... 72 3.3.1 AST-Building Combinators............. 74 3.3.2 Value Stack as Context............... 76 3.4 Beyond Basics........................ 76 3.5 Java 8 Grammar....................... 77 4 Infix Expression Parsing 79 4.1 Outline............................ 80 4.2 Grammatical Encodings of Expression Syntax....... 82 4.3 The Semantics of Left-Recursion.............. 84 4.4 The Semantics of Left-Associativity............. 86 4.4.1 Naive Formulation.................. 87 4.4.2 Ambiguous Recursion................ 87 4.4.3 Restating the Problem................ 88 4.4.4 A Pragmatic Way Out................ 89 4.4.5 Related Work..................... 90 4.5 Transparent Left-Recursion Handling in PEG....... 91 4.5.1 The Left-Recursive Algorithm............ 91 4.5.2 Transparent Left-Recursion in Autumn...... 94 4.5.3 Performance Woes.................. 97 4.6 Automatic Left-Recursion Discovery............ 98 4.7 Expression Clusters...................... 102 4.8 Left-Associativity via Left-Folding............. 106 4.8.1 Syntax Trees and Left Folds............. 106 4.8.2 Left-Folding by Hand................ 109 4.8.3 Left-Folding in Autumn............... 112 4.9 Defining Expression Families................ 113 4.10 Discussion........................... 118 4.11 Related Work......................... 120 5 Principled Stateful Parsing 121 5.1 Context-Sensitive Parsing.................. 122 5.1.1 Recall and Context-Sensitive Features....... 122 CONTENTS 5 5.1.2 Context-Sensitive Grammars............ 123 5.1.3 Context-Sensitivity & Parser Combinators..... 124 5.2 State of The Art....................... 125 5.2.1 Backtracking Semantic Actions........... 125 5.2.2 Data-Dependent Grammars............. 127 5.2.3 Monadic Parsers................... 128 5.2.4 Attribute Grammars................. 129 5.2.5 Unprincipled Stateful Parsing............ 129 5.2.6 Rats!.......................... 130 5.2.7 Marpa and Ruby Slippers.............. 131 5.3 Context Transparency.................... 131 5.4 Intuition............................ 133 5.5 Formalization......................... 135 5.5.1 Parse State...................... 136 5.5.2 Parsers........................ 137 5.5.3 Primitive Operations................. 138 5.5.4 Parser Invocation Semantics............. 140 5.6 Implementation........................ 141 5.6.1 From Theory to Practice.............. 141 5.6.2 Primitive Operations and the Log......... 145 5.6.3 Operationalization and Usability Concerns.... 148 5.6.4 Examples....................... 152 5.6.5 Alternatives...................... 157 5.7 Conclusion.......................... 159 6 Engineering Aspects 161 6.1 Performance Considerations................. 163 6.1.1 The Conspicuous Absence of Exponentiality.... 164 6.1.2 Inefficient Idioms in Simple PEG Parsing..... 164 6.1.3 Packrat Parsing.................... 165 6.1.4 Memoization in Autumn............... 168 6.1.5 Megamorphic Call Sites............... 169 6.2 Performance Comparison.................. 172 6.2.1 Run Times and Hot Spots.............. 173 6.2.2 Virtual Machine Effects............... 177 6.2.3 Memory Footprint.................. 178 6.2.4 Discussion....................... 178 6.3 Lexical Analysis....................... 179 6.3.1 Motivation...................... 180 6.3.2 Autumn’s Lexical Analysis Emulation....... 181 6.4 Error Reporting & Recovery................. 184 6.4.1 Performance & Error Reporting........... 185 6.4.2 Autumn’s Error-Reporting Capabilities...... 185 6 CONTENTS 6.4.3 Custom Error States & Introspection........ 187 6.4.4 Longest-Match Analysis............... 187 6.4.5 Error-Recoverable Parsers.............. 189 6.4.6 Lexical-Level Error Reporting............ 190 6.5 Grammar Traversal & Parser Visitors........... 192 6.5.1 Grammar Traversal................. 192 6.5.2 The Expression Problem............... 193 6.5.3 A Clean But Verbose Partial Solution....... 195 6.5.4 A User-Friendly Partial Solution.......... 198 6.5.5 Well-Formedness Checking Using Built-In Visitors 201 6.5.6 Grammar Transformation Using CopyVisitor ... 202 6.5.7 Abstract Parsers................... 203 6.5.8 Potential Further Applications........... 204 6.6 Debugging & Tracing..................... 206 6.6.1 Parser Invocation Stack Traces........... 206 6.6.2 Custom Parsers for State Inspection........ 208 6.6.3 Testing Support................... 208 6.6.4 Performance Tracing................. 210 6.7 Grammar Composition.................... 212 6.8 Conclusion.......................... 214 7 Conclusions and Future Directions 217 Bibliography 223 Appendices 235 A Autumn Java 8 Grammar 237 Chapter 1 Introduction This thesis is concerned with the limitations of currently available parsing systems, and how to overcome them. In this chapter, we introduce parsing, some of the challenges and trade-offs we perceive in the field, and our approaches to tackle these challenges. 1.1 Parsing Parsing is the process of analysing an input string in order to extract a structured representation of its content (a syntax tree) with respect to a specific language. In the thesis, we focus on parsing formal languages, such as programming or markup languages — as opposed to natural spoken languages. Unlike natural languages, formal languages are never ambiguous: there is only a single correct interpretation of the input. Parsing is a pervasive activity: every time a source file must be turned into executable code, a parser is required. Similarly, parsers are used to convert input files into relevant data structures. It is fair to say that most programs include a parser — sometimes many. As such, making parsers easier to write, use, and modify is a broadly beneficial endeavour. In particular, parsers should be written using a simple yet expressive notation. It should be easy to modify existing parsers, as the language definition may evolve. Finally, the parsers should be able to support diverse practical non-linguistic requirements, such as the generation of intelligible error messages when the input does not conform to expectations, permissive parsing in the presence of such errors, 7 8 CHAPTER 1. INTRODUCTION and the ability to derive various artifacts from the parser specification. Broadly, the practice of parsing can be divided between ad-hoc parsing, where a programmer writes code to implement a parser for a single language; and the use of grammarware [49] — parsing tools that derive from or use a grammar. Grammarware also includes other grammar-based tools, such as syntax highlighters and pretty-printers. Parsing is one of the oldest disciplines in computer science, yet even today, practical challenges and inadequacies remain — whether one uses ad-hoc parsing or grammarware. 1.2 Inadequacy: Flexibility Versus Simplicity Ad-hoc parsers have many advantages: they can be fast and can be cus- tomized to do exactly what the programmer requires. On the other hand, they are verbose and do not constitute a readable language description. Nor is it easy to verify that the parser conforms to an existing description of the language. Additionally, there is very little possibility of reusing the parsing code to perform other tasks that depend on the structure