Surface Without Structure Word Order and Tractability Issues in Natural Language Analysis

Surface without Structure word order and tractability issues in natural language analysis ..., daß Frank Julia Fred schwimmen helfen sah ...that Frank saw Julia help Fred swim ...dat Frank Julia Fred zag helpen zwemmen Annius Groenink Surface without Structure Word order and tractability issues in natural language analysis Oppervlakte zonder Structuur Woordvolgorde en computationele uitvoerbaarheid in natuurlijke- taalanalyse (met een samenvatting in het Nederlands) Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht, op gezag van de Rector Magnificus, Prof. dr H.O. Voorma ingevolge het besluit van het College van Decanen in het openbaar te verdedigen op vrijdag 7 november 1997 des middags te 16.15 uur door Annius Victor Groenink geboren op 29 december 1971 te Twello (Voorst) Promotoren: Prof. dr D.J.N. van Eijck Onderzoeksinstituut voor Taal en Spraak Universiteit Utrecht Centrum voor Wiskunde en Informatica Prof. dr W.C. Rounds Deptartment of Electrical Engineering and Computer Science University of Michigan The research for this thesis was funded by the Stichting Informatica-Onderzoek Ned- erland (SION) of the Dutch national foundation for academic research NWO, un- der project no. 612-317-420: incremental parser generation and context-sensitive disambiguation: a multidisciplinary perspective, and carried out at the Centre for Mathematics and Computer Science (CWI) in Amsterdam. Copyright c 1997, Annius Groenink. All rights reserved. Printed and bound at CWI, Amsterdam; cover: focus on non-projectivity in Dutch. ISBN 90-73446-70-8 Contents P Prologue 5 1 Phrase structure and dependency 13 1.1 Context-free grammar 14 1.2 Dependency 22 1.3 Some meta-theory 29 1.4 Overview of the thesis 37 I Formal Structure 41 2 Extraposition grammar 43 2.1 Definition and examples 44 2.2 Dutch sentence order and loose balancedness 49 2.3 An island rule 52 3 Tuple-based extensions of context-free grammar 57 3.1 Multiple context-free grammar 59 3.2 Modified head grammar 62 3.3 Linear MCFG 64 3.4 Parallel MCFG 70 3.5 Literal movement grammar 71 II Computational Tractability 81 4 Complexity and sentence indices 83 4.1 Definitions 84 4.2 Index LMG and the predicate descent algorithm 88 4.3 Deterministic bounded activity machines 93 4.4 Alternation and LFP calculi 97 5 Literal movement grammar 105 5.1 Complexity of simple LML 106 5.2 Finding the degree of the polynomial 113 5.3 Parsing and universal recognition 115 5.4 Complexity of bounded LML 117 3 4 6 Computational properties of extraposition grammar 119 6.1 Non-decidability of generic XG 120 6.2 Bounded and loose XG recognition 123 7 Trees and attributes, terms and forests 129 7.1 Annotated grammars 130 7.2 A prototype of an S -LMG parser 141 7.3 Off-line parsability and multi-stage analysis 148 7.4 Attribute evaluation 155 7.5 Using simple LMG for extended representations 160 III Principles 163 8 Abstract generative capacity 165 8.1 Weak generative capacity and the mildly context-sensitive 166 8.2 Strong generative capacity and the dependency domain 172 9 Government-Binding theory 177 9.1 Phrase structure and X syntax 179 9.2 Movement and adjunction 181 9.3 Examples 183 9.4 Replacing S-structure with literal movement 187 10 Free head grammar 191 10.1 Basic setup for a HG of English 192 10.2 Sentential structure of Dutch 197 10.3 Sentential structure of Latin 201 10.4 Localization of relative clause dependencies 206 R Reference diagrams 214 E Epilogue: parallel and hybrid developments 217 E.1 Categorial grammar 218 E.2 Tree adjoining and head grammar 224 E.3 Concluding discussion 230 S Samenvatting in het Nederlands 237 C Curriculum vitae 243 B Bibliography 245 Prologue This book is a report on four years of research for a PhD degree that I did in a project called INCREMENTAL PARSER GENERATION AND CONTEXT-SENSITIVE DISAMBIGUATION: A MULTIDISCIPLINARY PERSPECTIVE. The aim of this project is to build a bridge between knowledge about and techniques for language description in Software Engineering and Computational Linguistics. The train of thought to be explored was that in Computer Science, as well as in Linguistics, the languages in consideration have elements of non-context-freeness, but the approaches to dealing with such properties are rather different. Linguistics could benefit from the knowledge of efficient parsing techniques, such as the ‘incremental’ GLR parsing from [Rek92]. Some elements of non-context- freeness in computer languages are the offside rule in Miranda [Tur90] and freely definable operator precedences such as in Prolog. In Computer Science it is even common practice to work with systems that are not capable of describing any context- free language: in traditional LALR-driven analysis using tools like YACC, operator precedences are treated in a rather old-fashioned, ad hoc manner; it was thought that here Software Engineering could benefit from linguistic experience; a key concept in unifying the trans-context-free methods in the two fields was that of a DISAMBIGUATION phase defined on PARSE FORESTS that are the output of a context-free analysis. The idea then was to make a thorough assessment of the question precisely which types and which amounts of context-sensitivity are necessary to handle the non-context-free elements of both computer languages and natural languages, and to develop, based on this experience, a syntactic basis which would serve both as an improvement of the software engineering environment ASF+SDF [Kli93] and as a basis for linguistic analysis. The project has been carried out in two lines of research; one concentrating on methods for syntax definition in a Software Engineering context, carried out by EELCO VISSER, reported on in his PhD thesis [Vis97], and the other, reported in this book, concentrating on problems of a linguistic nature. Whereas in the Software Engineering part, a shift from the syntactical phase to a disambiguation phase turned out to be desirable, the linguistic counterpart showed a rather opposite movement; instead of analyzing the notion of ‘disambiguation’, my work can be characterized as eliminating any form of disambiguation that would be required by a system that is based on a form of CONTEXT-FREE BACKBONE,by increasing the power of the syntactic basis from context-free grammar to structurally more powerful mechanisms, eliminating among other things the problem of grammars not being OFF-LINE PARSABLE (section 7.3). However, there is also an important analogy in the methods developed in Visser’s perspective and mine; both the shift in the ASF+SDF group in several phases from an initially typically LALR-driven analysis to a scannerless context-free analysis conceptually followed by context-sensitive filtering 5 6 Prologue operations defining such things as lexical keywords and precedence definitions, and the increase of the structural capacities of underlying linguistic formalisms studied in this thesis are aimed at increased RE-USABILITY or MODULARITY. In Software Engineering, one thus avoids having to rewrite considerable parts of a grammar when adapting an engineering tool to a slightly different dialect of a computer language (say Cobol); in Linguistics, one has the advantage that languages with similar deep structures, but essentially different surface orders, such as Dutch (SOV) vs. English 1 (SVO) can optimally share a semantic phase based on simple expressions. Several choices made in this thesis are motivated by the wish to build natural language systems relying on equational specification methods for post-syntactic processing, and altough this is probably not visible at the surface, the frequent and pleasant discussions I had with Eelco have certainly had an encouraging influence on my work. * It is good for the understanding of this text to let it begin with a brief chronological explanation of how my research has progressed; not least because in order to present the material I have gathered over the years in a continuous story-line, it turned out to be hardly necessary to change the order of the subjects as they presented themselves. This Prologue then serves to acknowledge, in a concrete manner, the interest, help and guidance I received from various people, as well as to introduce the reader briefly to the points of departure of this work, before s/he proceeds to the introductory chapter 1, which more or less necessarily starts with a discouraging amount of formal background material before getting to the point where these points of departure have been properly illustrated and an overview of the three parts of this book can be given in some detail. * In March 1994, my supervisor JAN VAN EIJCK arranged for me to go on a visit to STEPHEN PULMAN at SRI Cambridge, for a short project to get acquainted with work on natural language grammars. In Cambridge, I modified a unification grammar that

Surface Without Structure Word Order and Tractability Issues in Natural Language Analysis

Fundamental Methodological Issues of Syntactic Pattern Recognition

Lecture 5 Mildly Context-Sensitive Languages

The Linguistic Relevance of Tree Adjoining Grammar

Algorithm for Analysis and Translation of Sentence Phrases

Parsing Discontinuous Structures

Fundamental Study

Technical Report No. 2006-514 All Or Nothing? Finding Grammars That

Formal and Computational Aspects of Natural Language Syntax Owen Rambow University of Pennsylvania

On the Degree of Nondeterminism of Tree Adjoining Languages and Head Grammar Languages Suna Bensch, Maia Hoeberechts

Discontinuous Data-Oriented Parsing Through Mild Context-Sensitivity

Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication

Expressivity and Complexity of the Grammatical Framework