Natural Language Parsing with Graded Constraints

Natural Language Parsing with Graded Constraints Dissertation zur Erlangung des Doktorgrades am Fachbereich Informatik der Universität Hamburg vorgelegt von Ingo Schröder aus Uelzen Hamburg im Jahr 2002 Genehmigt vom Fachbereich Informatik der Universität Hamburg auf Antrag von Betreuer: Prof. Dr.-Ing. Wolfgang Menzel Fachbereich Informatik Universität Hamburg 2. Gutachter: Prof. Dr. Christopher Habel Fachbereich Informatik Universität Hamburg Externe Gutachterin: Prof. Mary P. Harper, Ph.D. School of Electrical and Computer Engineering Purdue University, IN, USA Hamburg, den 15. April 2002 Dekan Prof. Dr. Siegfried Stiehl Dedicated to my mother Ursula Schröder Abstract Based on the working hypothesis that gradation is a central phenomenon in natural language, this thesis proposes the grammar formalism of weighted constraint dependency grammars (WCDG) as a unified framework for the treatment of gradation. Gradation can be found in different forms in natural language: as a preference for one outcome over another or as an extended notion of grammaticality which accepts not only `well-formed' sentences but also deviant structures if no other analysis is feasible. It also occurs in inherently uncertain information such as that from automatic recognition of speech or blurred hand-writing. WCDGs not only offer a well-defined formalism to concisely express what the underlying constraints of natural language understanding are but additionally allow one to state which status a constraint has, be it a strict rule which describes what structures are possibly imaginable or be it a slight preference which only selects the preferred one among a set of ambiguous analyses for a sentence. The formalism of WCDGs is formally defined both syntactically by specifying the constraint language and semantically by providing a concise mapping from the WCDG parsing problem to the well-known class of constraint satisfaction optimization problems. The set of languages generated by WCDGs have been ranked between the mildly context-sensitive and the context- ¢¡ sensitive languages; the word problem has been identified as being -complete. WCDGs are shown to be appropriate to be used for modeling a large variety of linguistic phe- nomena such as immediate dominance, agreement, valence, aspects of word order, projectivity as well as semantic aspects such as thematic roles. A spectrum of algorithms that solve the WCDG parsing problem can be devised from which an application can select the most appropriate method for the task at hand. Algorithms exist which are complete and sound; others approximate the correct solution but compare favorably with respect to temporal characteristics. A large set of experimental results confirms that WCDGs are well-suited for handling gradation in natural language: £ Graded constraints facilitate a robust mode of parsing so that even utterances with con- siderable deviations can be analyzed. £ The inherent robustness and the availability of characterizations of constraint conflicts make the WCDG parser a suitable candidate for a diagnosis component in applications for computer-assisted language learning. £ The WCDG approach is capable of trading runtime against solution quality. v vi Abstract £ External graded assessments of linguistic structures such as those originating from a speech recognizer can be integrated into the grammatical arbitration mechanism. £ Constraint weights can be learned from annotated corpora, thus mitigating the grammar acquisition problem. WCDGs have been found to offer solutions for a series of problems in natural language processing such as robustness, integration of preferences, information fusion and explanatory clarity. So far WCDGs have been studied in an academic context but realistic applications begin to appear on the horizon. Zusammenfassung Ausgehend von der Arbeitshypothese, daß Gradiertheit ein zentrales Phänomen natürlicher Spra- che ist, schlägt diese Arbeit den Grammatikformalismus der gewichteten Constraint-Dependenz- grammatiken (weighted constraint dependency grammar, WCDG) als einen universellen Rahmen für die Behandlung von Gradiertheit vor. Gradiertheit kommt in verschiedenen Formen in natürlicher Sprache vor: als Präferenz für eine Variante gegenüber einer anderen oder als erweiterter Grammatikalitätsbegriff, welcher nicht nur die `wohlgeformten' Sätze akzeptiert, sondern auch abweichende Strukturen, wenn sonst keine Lösung gefunden werden kann. Gradiertheit begegnet uns auch als inhärent unsichere Information wie zum Beispiel beim automatischen Erkennen von Sprache oder undeutlichen Handschriften. Gewichtete Constraint-Dependenzgrammatiken stellen nicht nur einen präzisen Formalismus dar, in dem die dem Sprachverstehen zugrunde liegenden Bedingungen beschrieben werden können, sondern erlauben auch, den Status einer Bedingung auszudrücken: Sei es eine starre Regel, die beschreibt, was überhaupt als vorstellbar zu gelten hat, oder als schwache Präferenz, die aus einer Menge von ambigen Analysen die bevorzugte auswählt. Der Formalimus der WCDG wird formal definiert, sowohl syntaktisch durch die Spezifika- tion der Constraint-Sprache als auch semantisch durch eine präzise Abbildung des WCDG- Parsingproblems auf die bekannten Constraint-Optimierungsprobleme. Die Menge der durch WCDG generierten Sprachen kann zwischen der der mild kontextsensitiven und der der kontext- ¤¡ sensitiven eingeordnet werden. Das Wortproblem für WCDG ist als -vollständiges Problem erkannt worden. Es wird gezeigt, daß WCDG für die Modellierung einer großen Menge von linguistischen Phä- nomen geeignet ist: unmittelbare Dominanz, Kongruenz, Valenz, Aspekte von Wortstellung, Projektivität und semantische Aspekte wie thematische Rollen. Ein Spektrum von Algorithmen zum Lösen des WCDG-Parsingproblems wird angegeben, aus denen eine Anwendung diejenige Methode auswählen kann, die am geeignetsten erscheint. Neben vollständigen und korrekten Algorithmen existieren auch solche, die eine Lösung nur approxi- mativ berechnen, dafür aber ein besseres Zeitverhalten zeigen. Eine Anzahl von experimentellen Ergebnissen belegt, daß sich die WCDG gut für die Behandlung von Gradiertheit in natürlicher Sprache eignen: £ Durch den Einsatz von gewichteten Constraints wird robustes Parsing ermöglicht, so daß sogar Äußerungen mit erheblichen Störungen analysiert werden können. vii viii German Abstract £ Die inhärente Robustheit und die Verfügbarkeit von Informationen über Constraint-Ver- letzungen ermöglichen den Einsatz des WCDG-Parsers als Diagnosekomponente in An- wendungen für das computergestützte Erlernen von Fremdsprachen. £ Der Ansatz erlaubt eine Verringerung der Laufzeit zu Lasten der Lösungsqualität (und umgekehrt). £ Externe gewichtete Bewertungen von linguistischen Strukturen wie zum Beispiel diejenigen aus einem automatischen Spracherkenner können in das grammatische Bedingungssystem integriert werden. £ Die Constraint-Gewichte können anhand eines annotierten Korpus gelernt werden, so daß der Aufwand der Grammatikerstellung gemindert wird. Es wurde festgestellt, daß die gewichteten Constraint-Dependenzgrammatiken Lösungen für eine Reihe von Problemen der automatischen Verarbeitung von natürlicher Sprache anzubieten haben: robustes Parsing, Integration präferenziellen Wissens, Informationsverschmelzung und Erklärungsfähigkeit. Bisher wurden WCDG im akademischen Rahmen untersucht, aber realisti- sche Anwendungen zeichnen sich bereits am Horizont ab. Acknowledgments I am grateful to many people who were directly or indirectly involved in the preparation of this thesis. In particular, I want to thank £ Wolfgang Menzel, for providing a stimulating working environment and for innumerable discussions from which this thesis has greatly benefited; £ Kilian Foth, Horia F. Pop and Michael Schulz, my colleagues in the DAWAI project, for the cool times together, for endless discussions and for their work on the WCDG parsing system; £ Dietmar Fünning, Jochen Hagenström, Stefan Hamerich and Andrzey Walczak, my student workers in the DAWAI project, for spending hours and hours improving the WCDG parsing system; £ Mary Harper and the speech group at Purdue University, for letting me stay with them in 1999 and for several fruitful discussions about the nature of CDG; £ Johannes Heinecke, Jürgen Kunze and Andreas Nolda of the Humboldt Universität zu Berlin, for the good time while we were working together in the first phase of the DAWAI project and for linguistic advice; £ my colleagues at the natural language division of the computer science department of the University of Hamburg, for the professional and enjoyable working atmosphere; £ Daniel Herron and Soenke Ziesche, for proofreading earlier versions of this thesis; £ Ulrike Meerstein, for proofreading, continuous support and encouragement throughout the work on this thesis. Part of the research reported in this thesis was conducted while the author was employed in the DAWAI project `Parsing of spoken language with limited resources'1 that was funded by the Ger- man Science Foundation `Deutsche Forschungsgemeinschaft (DFG)' under grant no. Me 1472/1- 1 and Me 1472/1-2. 1See http://nats-www.informatik.uni-hamburg.de/~dawai/ for further information. ix x Acknowledgments Contents Abstract v German Abstract vii Acknowledgments ix Contents xi List of Figures xvii List of Tables xxi 1 Introduction 1 1.1 Gradation in natural language . 1 1.1.1 Grammaticality, acceptability and beyond . 3 1.1.2 Preference . 5 1.1.3 Combining different levels of analysis . 6 1.1.4 Uncertain information . 7 1.1.5 Summary

Natural Language Parsing with Graded Constraints

Annotation Schemes in North Sami Dependency Parsing

Word Classes and Part-Of-Speech 5 Tagging

Analysing Constraint Grammar with SAT I L

When Grammar Can't Be Trusted

A Constraint Grammar Parser for Spanish

Finnpos: an Open-Source Morphological Tagging and Lemmatization Toolkit for Finnish

Morphosyntactic Disambiguation in an Endangered Language Setting

Arxiv:2105.12428V1 [Cs.CL] 26 May 2021

CG3 Beyond Classical Constraint Grammar

Applying Constraint Grammar to Tibetan NLP Edward Garrett & Nathan Hill

Solving Translational Problems Through Constraint Grammar - a Practical Case Study of Possession Homographs

Instant Annotations in ELAN Corpora of Spoken and Written Komi, an Endangered Language of the Barents Sea Region