Thesis Dynamic Generalized Parsing and Natural Mathematical Language

Total Page:16

File Type:pdf, Size:1020Kb

Thesis Dynamic Generalized Parsing and Natural Mathematical Language DISSERTATION / DOCTORAL THESIS Titel der Dissertation / Title of the Doctoral Thesis Dynamic Generalized Parsing and Natural Mathematical Language verfasst von / submitted by Mag. Kevin Kofler, Bakk. angestrebter akademischer Grad / in partial fulfilment of the requirements for the degree of Doktor der Naturwissenschaften (Dr.rer.nat) Wien, 2017 / Vienna, 2017 Studienkennzahl lt. Studienblatt / A 091 405 degree programme code as it appears on the student record sheet: Dissertationsgebiet lt. Studienblatt / Mathematik field of study as it appears on the student record sheet: Betreut von / Supervisor: Univ.-Prof. Dr. Arnold Neumaier Abstract This thesis introduces the dynamic generalized parser DynGenPar and its applications. The parser is aimed primarily at natural mathematical language. It was also successfully used for several formal languages. DynGenPar is available at https://www.tigen.org/ kevin.kofler/fmathl/dyngenpar/. The thesis presents the algorithmic ideas behind DynGenPar, gives a short overview of the implementation, documents the applications the parser is currently used for, and presents some efficiency results. Parts of this thesis, mainly in Chapter 2, are based on the refereed conference paper Kofler & Neumaier [56]. The DynGenPar algorithm combines the efficiency of Generalized LR (GLR) parsing, the dynamic extensibility of tableless approaches, and the expressiveness of extended context- free grammars such as parallel multiple context-free grammars (PMCFGs). In particular, it supports efficient dynamic rule additions to the grammar at any moment. The algorithm is designed in a fully incremental way, allowing to resume parsing with additional tokens without restarting the parse process, and can predict possible next tokens. Additionally, it handles constraints on the token following a rule. These allow for grammatically correct English indefinite articles when working with word tokens. They can also represent typical operations for scannerless parsing such as maximal matches when working with character tokens. Several successful applications of DynGenPar are documented in this thesis. DynGen- Par is a core component of the Concise project, a framework for manipulating semantic information both graphically and programmatically, developed at the University of Vi- enna. DynGenPar is used to parse the formal languages defined by Concise, specifying type systems, programs, and record transformations from one type system to another. Other formal languages with a DynGenPar grammar are a modeling language for chemi- cal processes, a proof-of-concept grammar for optimization problems using dynamic rule additions, and a subset of the AMPL modeling language for optimization problems, ex- tended to also allow intervals wherever AMPL expects a number. DynGenPar can import compiled grammars from the Grammatical Framework (GF) and parse text using them. A DynGenPar grammar also exists for the controlled natural mathematical language Naproche. The use of dynamic rule additions to support mathematical definitions was implemented in a grammar as a proof of concept. There is work in progress on a grammar for the controlled natural mathematical language MathNat. Finally, there is also a well- working DynGenPar grammar for LATEX formulas from two university-level mathematics textbooks. The long-term goal is to computerize a large library of existing mathematical knowledge using DynGenPar. Keywords: dynamic generalized parser, dynamic parser, tableless parser, scannerless parser, parser, parallel multiple context-free grammars, common mathematical language, natural mathematical language, controlled natural language, mathematical knowledge management, formalized mathematics, digital mathematical library 3 4 Zusammenfassung Diese Dissertation stellt den dynamischen verallgemeinerten Parser DynGenPar und seine Anwendungen vor. Der Parser zielt hauptsächlich auf natürliche mathematische Sprache ab. Er wurde auch erfolgreich für mehrere formale Sprachen verwendet. DynGenPar ist auf https://www.tigen.org/kevin.kofler/fmathl/dyngenpar/ verfügbar. Die Dissertation präsentiert die algorithmischen Ideen hinter DynGenPar, gibt eine kurze Übersicht über die Implementierung, dokumentiert die Applikationen, für die der Parser derzeit verwendet wird, und präsentiert einige Effizienzergebnisse. Teile dieser Disserta- tion, vor allem in Kapitel 2, basieren auf dem referierten Konferenzartikel Kofler & Neumaier [56]. Der DynGenPar-Algorithmus kombiniert die Effizienz verallgemeinerten LR-Parsens (GLR), die dynamische Erweiterbarkeit tabellenloser Ansätze, sowie die Ausdrucksstärke erweiteter kontextfreier Grammatiken wie paralleler mehrfacher kontextfreier Grammatiken (PMCFG). Insbesondere unterstützt er das effiziente Hinzufügen von Regeln zur Grammatik zu jedem Zeitpunkt. Der Algorithmus ist komplett inkrementell konzipiert, erlaubt also, einen unterbrochenen Parsingvorgang mit zusätzlichen Tokens fortzusetzen, ohne den Parsingvorgang neu zu starten, und kann mögliche nächste Tokens vorhersagen. Zusätzlich behandelt er vorgegebene Einschränkungen (constraints) für den auf eine Regel folgenden Token. Diese erlauben grammatikalisch korrekte englische indefinite Artikel, wenn mit Wörtern als Tokens gearbeitet wird. Sie können auch typische Operationen für scannerloses Parsen wie etwa das Finden einer Übereinstimmung maximaler Länge (maximales Matchen) darstellen, wenn mit Zeichen als Tokens gearbeitet wird. Mehrere erfolgreiche Anwendungen von DynGenPar sind in dieser Dissertation doku- mentiert. DynGenPar ist eine Kernkomponente des Concise-Projekts, eines Frameworks zum Manipulieren semantischer Information sowohl in graphischer als auch in program- matischer Form, entwickelt an der Universität Wien. DynGenPar wird verwendet, um die von Concise definierten formalen Sprachen zu parsen, die Typsysteme, Programme, sowie Recordumwandlungen von einem Typsystem in ein anderes spezifizieren. Ande- re formale Sprachen mit einer DynGenPar-Grammatik sind eine Modellierungssprache für chemische Prozesse, eine als Machbarkeitsbeweis dienende, dynamisches Hinzufügen von Regeln verwendende Grammatik für Optimierungsprobleme, sowie eine Teilmenge der Modellierungssprache für Optimierungsprobleme AMPL, erweitert, um auch Inter- valle zu erlauben, wo immer AMPL eine Zahl erwartet. DynGenPar kann kompilierte Grammatiken des Grammatical Framework (GF) importieren und Text mit ihnen parsen. Eine DynGenPar-Grammatik existiert auch für die kontrollierte natürliche mathematische Sprache Naproche. Die Benutzung dynamischem Hinzufügens von Regeln, um mathema- tische Definitionen zu unterstützen, wurde in einer Grammatik als Machbarkeitsbeweis implementiert. Eine Grammatik für die kontrollierte natürliche mathematische Sprache MathNat ist in Arbeit. Schließlich gibt es auch eine gut funktionierende DynGenPar- Grammatik für LATEX-Formeln aus zwei Mathematik-Textbüchern auf Universitätsniveau. Das langfristige Ziel ist, eine große Bibliothek bestehenden mathematischen Wissens mit- tels DynGenPar zu computerisieren. 5 Schlagwörter: dynamischer verallgemeinerter Parser, dynamischer Parser, tabellenlo- ser Parser, scannerloser Parser, Parser, parallele mehrfache kontextfreie Grammatiken, gewöhnliche mathematische Sprache, natürliche mathematische Sprache, kontrollierte na- türliche Sprache, mathematische Wissensverwaltung, formalisierte Mathematik, digitale mathematische Bibliothek 6 Contents Acknowledgements 11 1 Introduction 13 1.1 Goals . 14 1.2 Related Work . 16 1.3 Contents . 18 2 The Dynamic Generalized Parser DynGenPar 21 2.1 The DynGenPar Algorithm . 22 2.1.1 Design Considerations . 22 2.1.2 The Initial Graph . 22 2.1.3 Operations . 24 2.1.4 Example . 25 2.1.5 Analysis . 28 2.2 Implementation . 28 2.2.1 Technologies and API . 28 2.2.2 Licensing and Availability . 29 2.2.3 Implementation Considerations . 29 3 DynGenPar and Concise 33 3.1 Concise . 33 3.2 Embedding of DynGenPar into Concise . 37 3.3 Concise Features based on DynGenPar . 40 3.3.1 Type Sheets . 40 3.3.2 Code Sheets . 41 3.3.3 Record Transformations . 41 7 8 CONTENTS 3.3.4 Text Views . 41 3.3.5 Parsable File Views . 44 4 Applications of DynGenPar to Formal Languages 47 4.1 A Grammar for Type Sheets . 48 4.2 A Grammar for Code Sheets . 53 4.2.1 Elementary Acts . 53 4.2.2 Code Sheets . 56 4.3 A Grammar for Record Transformation Sheets . 59 4.3.1 Basics . 60 4.3.2 Document Structure . 61 4.3.3 Variables . 62 4.3.4 Expressions . 63 4.3.5 Path and Path Set Matches . 65 4.3.6 Command Blocks . 66 4.3.7 Command Lines . 70 4.4 A Grammar for Chemical Process Modeling . 72 4.5 An Extensible Grammar for Optimization Problems . 75 4.6 A Grammar for Robust AMPL . 82 5 DynGenPar and the Grammatical Framework (GF) 85 5.1 C Bindings for the GF Haskell Runtime . 86 5.2 PGF (Portable Grammar Format) File Import . 86 5.3 PgfTokenSource – A GF-compatible Lexer . 91 5.4 PGF GUI – Graphical Demo Application for Prediction on PGF Grammars 97 5.5 A GF Application Grammar for Mathematical Language . 99 6 Applications of DynGenPar to Natural Language 107 6.1 Naproche Parser using DynGenPar . 108 6.2 The TextDocument Toolchain . 110 6.3 BasicDefinitions – A Proof of Concept for Dynamic Rule Additions through Mathematical Definitions . 116 6.4 BasicReasoning – A Concise Grammar based on MathNat . 119 6.5 A Grammar for LATEX Formulas . 120 CONTENTS 9 7 Conclusion: Performance Results, Achievements, Extensions 129 7.1 Performance Benchmark . 130 7.1.1 Benchmark of DynGenPar vs. Bison on the Naproche Grammar . 130 7.1.2 Benchmark of DynGenPar vs. the Grammatical
Recommended publications
  • Adaptive LL(*) Parsing: the Power of Dynamic Analysis
    Adaptive LL(*) Parsing: The Power of Dynamic Analysis Terence Parr Sam Harwell Kathleen Fisher University of San Francisco University of Texas at Austin Tufts University [email protected] [email protected] kfi[email protected] Abstract PEGs are unambiguous by definition but have a quirk where Despite the advances made by modern parsing strategies such rule A ! a j ab (meaning “A matches either a or ab”) can never as PEG, LL(*), GLR, and GLL, parsing is not a solved prob- match ab since PEGs choose the first alternative that matches lem. Existing approaches suffer from a number of weaknesses, a prefix of the remaining input. Nested backtracking makes de- including difficulties supporting side-effecting embedded ac- bugging PEGs difficult. tions, slow and/or unpredictable performance, and counter- Second, side-effecting programmer-supplied actions (muta- intuitive matching strategies. This paper introduces the ALL(*) tors) like print statements should be avoided in any strategy that parsing strategy that combines the simplicity, efficiency, and continuously speculates (PEG) or supports multiple interpreta- predictability of conventional top-down LL(k) parsers with the tions of the input (GLL and GLR) because such actions may power of a GLR-like mechanism to make parsing decisions. never really take place [17]. (Though DParser [24] supports The critical innovation is to move grammar analysis to parse- “final” actions when the programmer is certain a reduction is time, which lets ALL(*) handle any non-left-recursive context- part of an unambiguous final parse.) Without side effects, ac- free grammar. ALL(*) is O(n4) in theory but consistently per- tions must buffer data for all interpretations in immutable data forms linearly on grammars used in practice, outperforming structures or provide undo actions.
    [Show full text]
  • Disambiguation Filters for Scannerless Generalized LR Parsers
    Disambiguation Filters for Scannerless Generalized LR Parsers Mark G. J. van den Brand1,4, Jeroen Scheerder2, Jurgen J. Vinju1, and Eelco Visser3 1 Centrum voor Wiskunde en Informatica (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands {Mark.van.den.Brand,Jurgen.Vinju}@cwi.nl 2 Department of Philosophy, Utrecht University Heidelberglaan 8, 3584 CS Utrecht, The Netherlands [email protected] 3 Institute of Information and Computing Sciences, Utrecht University P.O. Box 80089, 3508TB Utrecht, The Netherlands [email protected] 4 LORIA-INRIA 615 rue du Jardin Botanique, BP 101, F-54602 Villers-l`es-Nancy Cedex, France Abstract. In this paper we present the fusion of generalized LR pars- ing and scannerless parsing. This combination supports syntax defini- tions in which all aspects (lexical and context-free) of the syntax of a language are defined explicitly in one formalism. Furthermore, there are no restrictions on the class of grammars, thus allowing a natural syntax tree structure. Ambiguities that arise through the use of unrestricted grammars are handled by explicit disambiguation constructs, instead of implicit defaults that are taken by traditional scanner and parser gener- ators. Hence, a syntax definition becomes a full declarative description of a language. Scannerless generalized LR parsing is a viable technique that has been applied in various industrial and academic projects. 1 Introduction Since the introduction of efficient deterministic parsing techniques, parsing is considered a closed topic for research, both by computer scientists and by practi- cioners in compiler construction. Tools based on deterministic parsing algorithms such as LEX & YACC [15,11] (LALR) and JavaCC (recursive descent), are con- sidered adequate for dealing with almost all modern (programming) languages.
    [Show full text]
  • Introduction to Programming Languages and Syntax
    Programming Languages Session 1 – Main Theme Programming Languages Overview & Syntax Dr. Jean-Claude Franchitti New York University Computer Science Department Courant Institute of Mathematical Sciences Adapted from course textbook resources Programming Language Pragmatics (3rd Edition) Michael L. Scott, Copyright © 2009 Elsevier 1 Agenda 11 InstructorInstructor andand CourseCourse IntroductionIntroduction 22 IntroductionIntroduction toto ProgrammingProgramming LanguagesLanguages 33 ProgrammingProgramming LanguageLanguage SyntaxSyntax 44 ConclusionConclusion 2 Who am I? - Profile - ¾ 27 years of experience in the Information Technology Industry, including thirteen years of experience working for leading IT consulting firms such as Computer Sciences Corporation ¾ PhD in Computer Science from University of Colorado at Boulder ¾ Past CEO and CTO ¾ Held senior management and technical leadership roles in many large IT Strategy and Modernization projects for fortune 500 corporations in the insurance, banking, investment banking, pharmaceutical, retail, and information management industries ¾ Contributed to several high-profile ARPA and NSF research projects ¾ Played an active role as a member of the OMG, ODMG, and X3H2 standards committees and as a Professor of Computer Science at Columbia initially and New York University since 1997 ¾ Proven record of delivering business solutions on time and on budget ¾ Original designer and developer of jcrew.com and the suite of products now known as IBM InfoSphere DataStage ¾ Creator of the Enterprise
    [Show full text]
  • Packrat Parsing: Simple, Powerful, Lazy, Linear Time
    Packrat Parsing: Simple, Powerful, Lazy, Linear Time Functional Pearl Bryan Ford Massachusetts Institute of Technology Cambridge, MA [email protected] Abstract 1 Introduction Packrat parsing is a novel technique for implementing parsers in a There are many ways to implement a parser in a functional pro- lazy functional programming language. A packrat parser provides gramming language. The simplest and most direct approach is the power and flexibility of top-down parsing with backtracking and top-down or recursive descent parsing, in which the components unlimited lookahead, but nevertheless guarantees linear parse time. of a language grammar are translated more-or-less directly into a Any language defined by an LL(k) or LR(k) grammar can be rec- set of mutually recursive functions. Top-down parsers can in turn ognized by a packrat parser, in addition to many languages that be divided into two categories. Predictive parsers attempt to pre- conventional linear-time algorithms do not support. This additional dict what type of language construct to expect at a given point by power simplifies the handling of common syntactic idioms such as “looking ahead” a limited number of symbols in the input stream. the widespread but troublesome longest-match rule, enables the use Backtracking parsers instead make decisions speculatively by try- of sophisticated disambiguation strategies such as syntactic and se- ing different alternatives in succession: if one alternative fails to mantic predicates, provides better grammar composition properties, match, then the parser “backtracks” to the original input position and allows lexical analysis to be integrated seamlessly into parsing. and tries another.
    [Show full text]
  • Compiler Construction
    Compiler construction PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 10 Dec 2011 02:23:02 UTC Contents Articles Introduction 1 Compiler construction 1 Compiler 2 Interpreter 10 History of compiler writing 14 Lexical analysis 22 Lexical analysis 22 Regular expression 26 Regular expression examples 37 Finite-state machine 41 Preprocessor 51 Syntactic analysis 54 Parsing 54 Lookahead 58 Symbol table 61 Abstract syntax 63 Abstract syntax tree 64 Context-free grammar 65 Terminal and nonterminal symbols 77 Left recursion 79 Backus–Naur Form 83 Extended Backus–Naur Form 86 TBNF 91 Top-down parsing 91 Recursive descent parser 93 Tail recursive parser 98 Parsing expression grammar 100 LL parser 106 LR parser 114 Parsing table 123 Simple LR parser 125 Canonical LR parser 127 GLR parser 129 LALR parser 130 Recursive ascent parser 133 Parser combinator 140 Bottom-up parsing 143 Chomsky normal form 148 CYK algorithm 150 Simple precedence grammar 153 Simple precedence parser 154 Operator-precedence grammar 156 Operator-precedence parser 159 Shunting-yard algorithm 163 Chart parser 173 Earley parser 174 The lexer hack 178 Scannerless parsing 180 Semantic analysis 182 Attribute grammar 182 L-attributed grammar 184 LR-attributed grammar 185 S-attributed grammar 185 ECLR-attributed grammar 186 Intermediate language 186 Control flow graph 188 Basic block 190 Call graph 192 Data-flow analysis 195 Use-define chain 201 Live variable analysis 204 Reaching definition 206 Three address
    [Show full text]
  • Incremental Scannerless Generalized LR Parsing
    SPLASH: G: Incremental Scannerless Generalized LR Parsing Maarten P. Sijm Delft University of Technology Delft, The Netherlands [email protected] Abstract a separate lexing (or scanning) phase, supports modelling We present the Incremental Scannerless Generalized LR the entire language syntax in one single grammar, and al- (ISGLR) parsing algorithm, which combines the benefits of lows composition of grammars for different languages. One Incremental Generalized LR (IGLR) parsing and Scanner- notable disadvantage is that the SGLR parsing algorithm of less Generalized LR (SGLR) parsing. The ISGLR parser can Visser [9] is a batch algorithm, meaning that it must process reuse parse trees from unchanged regions in the input and each entire file in one pass. This becomes a problem for soft- thus only needs to parse changed regions. We also present ware projects that have large files, as every small change incremental techniques for imploding the parse tree to an requires the entire file to be parsed again. Abstract Syntax Tree (AST) and syntax highlighting. Scan- Incremental Generalized LR (IGLR) parsing is an improve- nerless parsing relies heavily on non-determinism during ment over batch Generalized LR (GLR) parsing. Amongst parsing, negatively impacting the incrementality of ISGLR others, Wagner [10] and TreeSitter [8] have created parsing parsing. We evaluated the ISGLR parsing algorithm using algorithms that allow rapid parsing of changes to large files. file histories from Git, achieving a speedup of up to 25 times However, these algorithms use a separate incremental lexical over non-incremental SGLR. analysis phase which complicates the implementation of in- cremental parsing [11] and does not directly allow language CCS Concepts • Software and its engineering → Incre- composition.
    [Show full text]
  • Faster Scannerless GLR Parsing
    Faster Scannerless GLR Parsing Giorgios Economopoulos, Paul Klint, Jurgen Vinju Centrum voor Wiskunde en Informatica (CWI), Kruislaan 413, 1098 SJ Amsterdam, The Netherlands Abstract. Analysis and renovation of large software portfolios requires syntax analysis of multiple, usually embedded, languages and this is be- yond the capabilities of many standard parsing techniques. The tradi- tional separation between lexer and parser falls short due to the limita- tions of tokenization based on regular expressions when handling multiple lexical grammars. In such cases scannerless parsing provides a viable so- lution. It uses the power of context-free grammars to be able to deal with a wide variety of issues in parsing lexical syntax. However, it comes at the price of less efficiency. The structure of tokens is obtained using a more powerful but more time and memory intensive parsing algorithm. Scan- nerless grammars are also more non-deterministic than their tokenized counterparts, increasing the burden on the parsing algorithm even fur- ther. In this paper we investigate the application of the Right-Nulled Gener- alized LR parsing algorithm (RNGLR) to scannerless parsing. We adapt the Scannerless Generalized LR parsing and filtering algorithm (SGLR) to implement the optimizations of RNGLR. We present an updated pars- ing and filtering algorithm, called SRNGLR, and analyze its performance in comparison to SGLR on ambiguous grammars for the programming languages C, Java, Python, SASL, and C++. Measurements show that SRNGLR is on average 33% faster than SGLR, but is 95% faster on the highly ambiguous SASL grammar. For the mainstream languages C, C++, Java and Python the average speedup is 16%.
    [Show full text]
  • Mining Structured Data in Natural Language Artifacts with Island Parsing
    Science of Computer Programming 150 (2017) 31–55 Contents lists available at ScienceDirect Science of Computer Programming www.elsevier.com/locate/scico Mining structured data in natural language artifacts with island parsing ∗ Alberto Bacchelli a, , Andrea Mocci b, Anthony Cleve c, Michele Lanza b a Department of Informatics, University of Zurich, Switzerland b REVEAL @ Software Institute, Università della Svizzera italiana (USI), Switzerland c Faculty of Informatics, University of Namur, Belgium a r t i c l e i n f o a b s t r a c t Article history: Software repositories typically store data composed of structured and unstructured Received 11 December 2015 parts. Researchers mine this data to empirically validate research ideas and to support Received in revised form 18 November 2016 practitioners’ activities. Structured data (e.g., source code) has a formal syntax and is Accepted 13 June 2017 straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural Available online 1 August 2017 language, noise, and snippets of structured data, and it is harder to analyze. Especially Keywords: the structured content (e.g., code snippets) in unstructured data contains valuable Mining software repositories information. Unstructured data Researchers have proposed several approaches to recognize, extract, and analyze structured Island parsing data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with Java code fragments.
    [Show full text]
  • LNCS 9031, Pp
    Faster, Practical GLL Parsing Ali Afroozeh and Anastasia Izmaylova Centrum Wiskunde & Informatica, 1098 XG Amsterdam, The Netherlands {ali.afroozeh, anastasia.izmaylova}@cwi.nl Abstract. Generalized LL (GLL) parsing is an extension of recursive- descent (RD) parsing that supports all context-free grammars in cubic time and space. GLL parsers have the direct relationship with the gram- mar that RD parsers have, and therefore, compared to GLR, are easier to understand, debug, and extend. This makes GLL parsing attractive for parsing programming languages. In this paper we propose a more efficient Graph-Structured Stack (GSS) for GLL parsing that leads to significant performance improve- ment. We also discuss a number of optimizations that further improve the performance of GLL. Finally, for practical scannerless parsing of pro- gramming languages, we show how common lexical disambiguation filters canbeintegratedinGLLparsing. Our new formulation of GLL parsing is implemented as part of the Iguana parsing framework. We evaluate the effectiveness of our approach using a highly-ambiguous grammar and grammars of real programming languages. Our results, compared to the original GLL, show a speedup factor of 10 on the highly-ambiguous grammar, and a speedup factor of 1.5, 1.7, and 5.2 on the grammars of Java, C#, and OCaml, respectively. 1 Introduction Developing efficient parsers for programming languages is a difficult task that is usually automated by a parser generator. Since Knuth’s seminal paper [1] on LR parsing, and DeRemer’s work on practical LR parsing (LALR) [2], parsers of many major programming languages have been constructed using LALR parser generators such as Yacc [3].
    [Show full text]
  • Chapter 5 Disambiguation Strategies
    Eindhoven University of Technology MASTER Disambiguation mechanisms and disambiguation strategies ten Brink, A.P. Award date: 2013 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Department of Mathematics and Computer Science Den Dolech 2, 5612 AZ Eindhoven P.O. Box 513, 5600 MB Eindhoven The Netherlands www.tue.nl Supervisor prof. dr. Mark van den Brand (TU/e) Disambiguation mechanisms and Section disambiguation strategies Model Driven Software Engineering Date August 23, 2013 Master’s Thesis Alex P. ten Brink Where innovation starts Abstract It is common practice for engineers to express artifacts such as domain specific languages using ambiguous context-free grammars along with separate disambiguation rules, rather than to use equivalent unambiguous context-free grammars, as these ambiguous grammars often capture the intent of the engineer more directly.
    [Show full text]
  • One Parser to Rule Them All
    One Parser to Rule Them All Ali Afroozeh Anastasia Izmaylova Centrum Wiskunde & Informatica, Amsterdam, The Netherlands {ali.afroozeh, anastasia.izmaylova}@cwi.nl Abstract 1. Introduction Despite the long history of research in parsing, constructing Parsing is a well-researched topic in computer science, and parsers for real programming languages remains a difficult it is common to hear from fellow researchers in the field and painful task. In the last decades, different parser gen- of programming languages that parsing is a solved prob- erators emerged to allow the construction of parsers from a lem. This statement mostly originates from the success of BNF-like specification. However, still today, many parsers Yacc [18] and its underlying theory that has been developed are handwritten, or are only partly generated, and include in the 70s. Since Knuth’s seminal paper on LR parsing [25], various hacks to deal with different peculiarities in program- and DeRemer’s work on practical LR parsing (LALR) [6], ming languages. The main problem is that current declara- there is a linear parsing technique that covers most syntactic tive syntax definition techniques are based on pure context- constructs in programming languages. Yacc, and its various free grammars, while many constructs found in program- ports to other languages, enabled the generation of efficient ming languages require context information. parsers from a BNF specification. Still, research papers and In this paper we propose a parsing framework that em- tools on parsing in the last four decades show an ongoing braces context information in its core. Our framework is effort to develop new parsing techniques.
    [Show full text]
  • Semantics and Algorithms for Data-Dependent Grammars
    Semantics and Algorithms for Data-dependent Grammars Trevor Jim Yitzhak Mandelbaum David Walker AT&T Labs-Research AT&T Labs-Research Princeton University [email protected] [email protected] [email protected] Abstract ern times, programmer convenience and expressiveness were sacri- Traditional parser generation technologies are incapable of han- ficed for performance. Since then, for the most part, PL researchers dling the demands of modern programmers. In this paper, we have hung on to legacy tools because they are well-known, well- supported, taught in school and universally available, rather than present the design and theory of a new parsing engine, YAKKER, capable of handling the requirements of modern applications in- because they are optimally designed. cluding (1) full scannerless context-free grammars with (2) reg- On the other hand, programmers outside the minuscule world ular expressions as right-hand sides for defining nonterminals. of PL implementors almost never use parser generators. Despite the fact that they are constantly writing parsers—for data formats, YAKKER also includes (3) facilities for binding variables to in- termediate parse results and (4) using such bindings within arbi- networking protocols, configuration files, web scraping and small trary constraints to control parsing. These facilities allow the kind domain-specific languages—they do most of it by hand, often us- of data-dependent parsing commonly needed in systems applica- ing regular expression libraries that support context-sensitive fea- tions, particularly those that operate over binary data. In addition, tures like backreferencing and POSIX anchors. This is not because (5) nonterminals may be parameterized by arbitrary values, which they are unaware of PL tools, but rather because these tools do not gives the system good modularity and abstraction properties in the solve their problems.
    [Show full text]