source: https://doi.org/10.7892/boris.89195 | downloaded: 6.10.2021 e hlspic-auwseshflce Fakult Philosophisch-naturwissenschaftlichen der asn o gl Modeling Agile For rf sa Nierstrasz Oscar Dr. Prof. ntttf Institut Inauguraldissertation e Universit der etrdrArbeit: der Leiter o Tschechien von oglg von vorgelegt a Kur Jan rInformatik ur ¨ tBern at ¨ s ˇ at ¨

Parsing For Agile Modeling

Inauguraldissertation der Philosophisch-naturwissenschaftlichen Fakultat¨ der Universitat¨ Bern

vorgelegt von Jan Kursˇ von Tschechien

Leiter der Arbeit: Prof. Dr. Oscar Nierstrasz Institut fur¨ Informatik

Von der Philosophisch-naturwissenschaftlichen Fakultat¨ angenommen.

Bern, 25.10.2016 Der Dekan: Prof. Dr. Gilberto Colangelo This dissertation can be downloaded from scg.unibe.ch.

Copyright 2016 by Jan Kursˇ

This work is licensed under a Creative Commons Attribution-Non-Commercial-No derivative works 2.5 Switzerland license. To see the license go to http://creativecommons.org/licenses/by-sa/2.5/ch/

Attribution–ShareAlike Abstract

Agile modeling refers to a set of methods that allow for a quick initial development of an importer and its further refinement. These requirements are not met simultaneously by the current parsing technology. Problems with parsing became a bottleneck in our research of agile modeling. In this thesis we introduce a novel approach to specify and build parsers. Our approach allows for expressive, tolerant and composable parsers with- out sacrificing performance. The approach is based on a context-sensitive extension of parsing expression grammars that allows a grammar engineer to specify complex language restrictions. To insure high parsing perfor- mance we automatically analyze a grammar definition and choose differ- ent parsing strategies for different parts of the grammar. We show that context-sensitive parsing expression grammars allow for highly composable, tolerant and variable-grained parsers that can be easily refined. Different parsing strategies significantly insure high-performance of parsers without sacrificing expressiveness of the underlying grammars.

2

Contents

1 Introduction 8 1.1 Agile Modeling ...... 9 1.2 Parsing Obstacles of Agile Modeling ...... 12 1.3 Thesis ...... 13 1.4 Our Contribution ...... 13

2 Overview of Parsing Technologies 15 2.1 Parsing in the Wild ...... 15 2.1.1 Expressive Power ...... 15 2.1.2 Composability ...... 17 2.1.3 Tolerant Grammars and Semi-Parsing ...... 18 2.1.4 Performance ...... 19 2.1.5 Parsing Frameworks ...... 20 2.2 Existing Limitations ...... 22 2.3 Our Solution ...... 23

3 Parsing Expression Grammars and PetitParser 24 3.1 Parsing Expression Grammars ...... 24 3.1.1 PEG Analysis ...... 25 3.1.2 Parser Combinators ...... 26 3.2 PetitParser ...... 29

4 Context Sensitivity in Parsing Expression Grammars 32 4.1 Motivating Example ...... 33 4.2 Parsing Contexts ...... 34 4.2.1 Context-Sensitive Extension ...... 35 4.2.2 Indentation Stack ...... 35 4.3 Parsing Contexts in Parsing Expression Grammars ...... 37 4.3.1 Parser Combinators ...... 40 4.3.2 CS-PEG analysis ...... 40 4.4 Implementation ...... 43 4.4.1 Performance ...... 44 4.5 Case Studies ...... 47 4.5.1 Python ...... 47 4.5.2 Markdown ...... 49 4.6 Related Work ...... 52 4.7 Conclusion ...... 54

4 CONTENTS 5

5 Semi-Parsing with Bounded Seas 55 5.1 Motivating Example ...... 56 5.1.1 Why not use Regular Expressions? ...... 56 5.1.2 A Na¨ıve Island Grammar ...... 57 5.1.3 An Advanced Island Grammar ...... 57 5.2 Bounded Seas ...... 59 5.2.1 The Sea Boundary ...... 60 5.2.2 The Context Sensitivity of Bounded Seas ...... 61 5.3 Bounded Seas in Parsing Expression Grammars ...... 62 5.3.1 The Water Operator ...... 63 5.3.2 The NEXT function ...... 67 5.3.3 BS-PEG analysis ...... 71 5.4 Implementation ...... 72 5.4.1 Performance ...... 73 5.5 Java Parser Case Study ...... 73 5.5.1 Without Nested Classes ...... 75 5.5.2 With Nested Classes ...... 75 5.5.3 With Return Types ...... 76 5.5.4 Performance ...... 77 5.6 Related Work ...... 79 5.7 Conclusion ...... 82

6 Adaptable Parsing Strategies 83 6.1 Motivating Example ...... 84 6.1.1 Composition Overhead ...... 86 6.1.2 Superfluous Intermediate Objects ...... 86 6.1.3 Backtracking Overhead ...... 87 6.1.4 Context-Sensitivity Overhead ...... 87 6.2 A Parser Combinator Compiler ...... 88 6.2.1 Adaptable Strategies ...... 88 6.3 Parser Optimizations ...... 90 6.3.1 Regular Optimizations ...... 92 6.3.2 Context-Free Optimizations ...... 94 6.3.3 Context-Sensitive Optimizations ...... 96 6.4 Performance analysis ...... 100 6.4.1 PetitParser compiler ...... 100 6.4.2 Benchmarks ...... 101 6.4.3 Parsing Strategies Impact ...... 103 6.4.4 Scanner Impact ...... 105 6.4.5 Memoization Impact ...... 107 6.4.6 Java Parsers Comparison ...... 108 6.4.7 Smalltalk Parsers Comparison ...... 108 6.5 Related Work ...... 109 6.6 Conclusion ...... 110

7 Ruby Case study 112 7.1 Ruby Structure ...... 112 7.1.1 The Dangling End Problem ...... 113 7.1.2 Measurements ...... 115 7.2 Ruby Method Calls ...... 116 CONTENTS 6

7.2.1 Measurements ...... 117 7.3 Performance ...... 119 7.4 Conclusion ...... 121

8 Conclusion 123

A Formal development of PEGs 136

B Bounded Seas Examples 141 B.1 Example of Dynamic NEXT computation ...... 141 B.2 Example of Static NEXT computation ...... 142 B.3 Overlapping Seas Example ...... 147

C Implementation 152 C.1 Bounded seas ...... 152

D Layout Sensitivity in the Wild 159 D.1 Haskell ...... 159 D.2 Python ...... 160 D.3 F#...... 161 D.4 YAML ...... 162 D.5 OCaml ...... 162 D.6 CoffeeScript ...... 163 D.7 Grace ...... 163 D.8 SRFI 49 — Indentation-Sensitive Scheme ...... 164 D.9 Elastic Tabstops ...... 164

E Scanner 166 E.1 Scanners in PEG-based parsers ...... 166 E.1.1 Tokens and Scannable Parsing Expressions ...... 166 E.1.2 Scannable Choices ...... 167 E.1.3 Scanner ...... 167 E.2 Regular Parsing Expressions ...... 170 E.3 Regular Parsing Expression Languages ...... 172 E.4 Finite State Automata ...... 175 E.4.1 Construction of finite state automata from regular parsing ex- pressions (FSA)...... 178 E.4.2 Determinization of the automata with epsilons and priorities (D) 181

F Measurements 183 F.1 Summary ...... 183 F.2 Strategies Details ...... 187 F.2.1 Expressions ...... 190 F.2.2 IS Expressions ...... 191 F.2.3 CF Python ...... 192 F.2.4 Python ...... 193 F.2.5 Smalltalk ...... 194 F.2.6 Java ...... 195 F.2.7 Java Sea ...... 196 F.3 Scanner Impact ...... 197 F.3.1 Expressions ...... 199 CONTENTS 7

F.3.2 Smalltalk ...... 200 F.4 Memoization Details ...... 201 F.5 Smalltalk Parsers ...... 202 F.6 Java Parsers ...... 203

G CommonMark Grammar Definition 204 1 Introduction

It is widely accepted that software developers spend more time reading code than writ- ing it [LVD06]. Reading code not only promotes program comprehension, but helps developers understand the impact of their changes on the existing system. Neverthe- less there are numerous questions developers ask that cannot simply be answered by reading code, such as “which code implements this feature?”, or “what is the impact of this change?”, and for which dedicated analyses are needed [SMDV06, NL12]. Dedicated platforms exist to model and analyze software systems, such as Moose [NDG05] and Rascal [KvdSV09]. A prerequisite for using such tools, however, is that a model importer exists for the programming language (or languages) in which the system is developed. Constructing a model importer from scratch is a major ef- fort, and the large up-front investment hampers initiatives for many commercial tool builders [LV01, BBC+10]. Adapting an existing parser for the host language is often not an option especially for proprietary and legacy languages, or for sources mixed from different languages. We use the term agile modeling to refer to a set of meth