Thesis Dynamic Generalized Parsing and Natural Mathematical Language

DISSERTATION / DOCTORAL THESIS Titel der Dissertation / Title of the Doctoral Thesis Dynamic Generalized Parsing and Natural Mathematical Language verfasst von / submitted by Mag. Kevin Kofler, Bakk. angestrebter akademischer Grad / in partial fulfilment of the requirements for the degree of Doktor der Naturwissenschaften (Dr.rer.nat) Wien, 2017 / Vienna, 2017 Studienkennzahl lt. Studienblatt / A 091 405 degree programme code as it appears on the student record sheet: Dissertationsgebiet lt. Studienblatt / Mathematik field of study as it appears on the student record sheet: Betreut von / Supervisor: Univ.-Prof. Dr. Arnold Neumaier Abstract This thesis introduces the dynamic generalized parser DynGenPar and its applications. The parser is aimed primarily at natural mathematical language. It was also successfully used for several formal languages. DynGenPar is available at https://www.tigen.org/ kevin.kofler/fmathl/dyngenpar/. The thesis presents the algorithmic ideas behind DynGenPar, gives a short overview of the implementation, documents the applications the parser is currently used for, and presents some efficiency results. Parts of this thesis, mainly in Chapter 2, are based on the refereed conference paper Kofler & Neumaier [56]. The DynGenPar algorithm combines the efficiency of Generalized LR (GLR) parsing, the dynamic extensibility of tableless approaches, and the expressiveness of extended context- free grammars such as parallel multiple context-free grammars (PMCFGs). In particular, it supports efficient dynamic rule additions to the grammar at any moment. The algorithm is designed in a fully incremental way, allowing to resume parsing with additional tokens without restarting the parse process, and can predict possible next tokens. Additionally, it handles constraints on the token following a rule. These allow for grammatically correct English indefinite articles when working with word tokens. They can also represent typical operations for scannerless parsing such as maximal matches when working with character tokens. Several successful applications of DynGenPar are documented in this thesis. DynGen- Par is a core component of the Concise project, a framework for manipulating semantic information both graphically and programmatically, developed at the University of Vi- enna. DynGenPar is used to parse the formal languages defined by Concise, specifying type systems, programs, and record transformations from one type system to another. Other formal languages with a DynGenPar grammar are a modeling language for chemical processes, a proof-of-concept grammar for optimization problems using dynamic rule additions, and a subset of the AMPL modeling language for optimization problems, extended to also allow intervals wherever AMPL expects a number. DynGenPar can import compiled grammars from the Grammatical Framework (GF) and parse text using them. A DynGenPar grammar also exists for the controlled natural mathematical language Naproche. The use of dynamic rule additions to support mathematical definitions was implemented in a grammar as a proof of concept. There is work in progress on a grammar for the controlled natural mathematical language MathNat. Finally, there is also a well- working DynGenPar grammar for LATEX formulas from two university-level mathematics textbooks. The long-term goal is to computerize a large library of existing mathematical knowledge using DynGenPar. Keywords: dynamic generalized parser, dynamic parser, tableless parser, scannerless parser, parser, parallel multiple context-free grammars, common mathematical language, natural mathematical language, controlled natural language, mathematical knowledge management, formalized mathematics, digital mathematical library 3 4 Zusammenfassung Diese Dissertation stellt den dynamischen verallgemeinerten Parser DynGenPar und seine Anwendungen vor. Der Parser zielt hauptsächlich auf natürliche mathematische Sprache ab. Er wurde auch erfolgreich für mehrere formale Sprachen verwendet. DynGenPar ist auf https://www.tigen.org/kevin.kofler/fmathl/dyngenpar/ verfügbar. Die Dissertation präsentiert die algorithmischen Ideen hinter DynGenPar, gibt eine kurze Übersicht über die Implementierung, dokumentiert die Applikationen, für die der Parser derzeit verwendet wird, und präsentiert einige Effizienzergebnisse. Teile dieser Disserta- tion, vor allem in Kapitel 2, basieren auf dem referierten Konferenzartikel Kofler & Neumaier [56]. Der DynGenPar-Algorithmus kombiniert die Effizienz verallgemeinerten LR-Parsens (GLR), die dynamische Erweiterbarkeit tabellenloser Ansätze, sowie die Ausdrucksstärke erweiteter kontextfreier Grammatiken wie paralleler mehrfacher kontextfreier Grammatiken (PMCFG). Insbesondere unterstützt er das effiziente Hinzufügen von Regeln zur Grammatik zu jedem Zeitpunkt. Der Algorithmus ist komplett inkrementell konzipiert, erlaubt also, einen unterbrochenen Parsingvorgang mit zusätzlichen Tokens fortzusetzen, ohne den Parsingvorgang neu zu starten, und kann mögliche nächste Tokens vorhersagen. Zusätzlich behandelt er vorgegebene Einschränkungen (constraints) für den auf eine Regel folgenden Token. Diese erlauben grammatikalisch korrekte englische indefinite Artikel, wenn mit Wörtern als Tokens gearbeitet wird. Sie können auch typische Operationen für scannerloses Parsen wie etwa das Finden einer Übereinstimmung maximaler Länge (maximales Matchen) darstellen, wenn mit Zeichen als Tokens gearbeitet wird. Mehrere erfolgreiche Anwendungen von DynGenPar sind in dieser Dissertation dokumentiert. DynGenPar ist eine Kernkomponente des Concise-Projekts, eines Frameworks zum Manipulieren semantischer Information sowohl in graphischer als auch in program- matischer Form, entwickelt an der Universität Wien. DynGenPar wird verwendet, um die von Concise definierten formalen Sprachen zu parsen, die Typsysteme, Programme, sowie Recordumwandlungen von einem Typsystem in ein anderes spezifizieren. Ande- re formale Sprachen mit einer DynGenPar-Grammatik sind eine Modellierungssprache für chemische Prozesse, eine als Machbarkeitsbeweis dienende, dynamisches Hinzufügen von Regeln verwendende Grammatik für Optimierungsprobleme, sowie eine Teilmenge der Modellierungssprache für Optimierungsprobleme AMPL, erweitert, um auch Inter- valle zu erlauben, wo immer AMPL eine Zahl erwartet. DynGenPar kann kompilierte Grammatiken des Grammatical Framework (GF) importieren und Text mit ihnen parsen. Eine DynGenPar-Grammatik existiert auch für die kontrollierte natürliche mathematische Sprache Naproche. Die Benutzung dynamischem Hinzufügens von Regeln, um mathematische Definitionen zu unterstützen, wurde in einer Grammatik als Machbarkeitsbeweis implementiert. Eine Grammatik für die kontrollierte natürliche mathematische Sprache MathNat ist in Arbeit. Schließlich gibt es auch eine gut funktionierende DynGenPar- Grammatik für LATEX-Formeln aus zwei Mathematik-Textbüchern auf Universitätsniveau. Das langfristige Ziel ist, eine große Bibliothek bestehenden mathematischen Wissens mit- tels DynGenPar zu computerisieren. 5 Schlagwörter: dynamischer verallgemeinerter Parser, dynamischer Parser, tabellenloser Parser, scannerloser Parser, Parser, parallele mehrfache kontextfreie Grammatiken, gewöhnliche mathematische Sprache, natürliche mathematische Sprache, kontrollierte na- türliche Sprache, mathematische Wissensverwaltung, formalisierte Mathematik, digitale mathematische Bibliothek 6 Contents Acknowledgements 11 1 Introduction 13 1.1 Goals . 14 1.2 Related Work . 16 1.3 Contents . 18 2 The Dynamic Generalized Parser DynGenPar 21 2.1 The DynGenPar Algorithm . 22 2.1.1 Design Considerations . 22 2.1.2 The Initial Graph . 22 2.1.3 Operations . 24 2.1.4 Example . 25 2.1.5 Analysis . 28 2.2 Implementation . 28 2.2.1 Technologies and API . 28 2.2.2 Licensing and Availability . 29 2.2.3 Implementation Considerations . 29 3 DynGenPar and Concise 33 3.1 Concise . 33 3.2 Embedding of DynGenPar into Concise . 37 3.3 Concise Features based on DynGenPar . 40 3.3.1 Type Sheets . 40 3.3.2 Code Sheets . 41 3.3.3 Record Transformations . 41 7 8 CONTENTS 3.3.4 Text Views . 41 3.3.5 Parsable File Views . 44 4 Applications of DynGenPar to Formal Languages 47 4.1 A Grammar for Type Sheets . 48 4.2 A Grammar for Code Sheets . 53 4.2.1 Elementary Acts . 53 4.2.2 Code Sheets . 56 4.3 A Grammar for Record Transformation Sheets . 59 4.3.1 Basics . 60 4.3.2 Document Structure . 61 4.3.3 Variables . 62 4.3.4 Expressions . 63 4.3.5 Path and Path Set Matches . 65 4.3.6 Command Blocks . 66 4.3.7 Command Lines . 70 4.4 A Grammar for Chemical Process Modeling . 72 4.5 An Extensible Grammar for Optimization Problems . 75 4.6 A Grammar for Robust AMPL . 82 5 DynGenPar and the Grammatical Framework (GF) 85 5.1 C Bindings for the GF Haskell Runtime . 86 5.2 PGF (Portable Grammar Format) File Import . 86 5.3 PgfTokenSource – A GF-compatible Lexer . 91 5.4 PGF GUI – Graphical Demo Application for Prediction on PGF Grammars 97 5.5 A GF Application Grammar for Mathematical Language . 99 6 Applications of DynGenPar to Natural Language 107 6.1 Naproche Parser using DynGenPar . 108 6.2 The TextDocument Toolchain . 110 6.3 BasicDefinitions – A Proof of Concept for Dynamic Rule Additions through Mathematical Definitions . 116 6.4 BasicReasoning – A Concise Grammar based on MathNat . 119 6.5 A Grammar for LATEX Formulas . 120 CONTENTS 9 7 Conclusion: Performance Results, Achievements, Extensions 129 7.1 Performance Benchmark . 130 7.1.1 Benchmark of DynGenPar vs. Bison on the Naproche Grammar . 130 7.1.2 Benchmark of DynGenPar vs. the Grammatical

Thesis Dynamic Generalized Parsing and Natural Mathematical Language

Adaptive LL(*) Parsing: the Power of Dynamic Analysis

Disambiguation Filters for Scannerless Generalized LR Parsers

Introduction to Programming Languages and Syntax

Packrat Parsing: Simple, Powerful, Lazy, Linear Time

Compiler Construction

Incremental Scannerless Generalized LR Parsing

Faster Scannerless GLR Parsing

Mining Structured Data in Natural Language Artifacts with Island Parsing

LNCS 9031, Pp

Chapter 5 Disambiguation Strategies

One Parser to Rule Them All

Semantics and Algorithms for Data-Dependent Grammars