Incremental Coreference Resolution for German

Incremental Coreference Resolution for German Thesis presented to the Faculty of Arts and Social Sciences of the University of Zurich for the degree of Doctor of Philosophy by Don Tuggener Accepted in the spring semester 2016 on the recommendation of the doctoral committee: Prof. Dr. Martin Volk (main advisor) PD Dr. Gerold Schneider Zurich, 2016 \Choose a job you love, and you will never have to work a day in your life." Confucius Abstract The main contributions of this thesis are as follows: 1. We introduce a general model for coreference and explore its application to Ger- man. • The model features an incremental discourse processing algorithm which al- lows it to coherently address issues caused by underspecification of mentions, which is an especially pressing problem regarding certain German pronouns. • We introduce novel features relevant for the resolution of German pronouns. A subset of these features are made accessible through the incremental architecture of the discourse processing model. • In evaluation, we show that the coreference model combined with our features provides new state-of-the-art results for coreference and pronoun resolution for German. 2. We elaborate on the evaluation of coreference and pronoun resolution. • We discuss evaluation from the view of prospective downstream applications that benefit from coreference resolution as a preprocessing component. Ad- dressing the shortcomings of the general evaluation framework in this regard, we introduce an alternative framework, the Application Related Coreference Scores (ARCS). • The ARCS framework enables a thorough comparison of different system outputs and the quantification of their similarities and differences beyond the common coreference evaluation. We demonstrate how the framework is applied to state-of-the-art coreference systems. This provides a method to track specific differences in system outputs, which assists researchers in comparing their approaches to related work in detail. 3. We explore semantics for pronoun resolution. • Within the introduced coreference model, we explore distributional approaches to estimate the compatibility of an antecedent candidate and the occurrence context of a pronoun. We compare a state-of-the-art approach for word em- beddings to syntactic co-occurrence profiles to this end. • In comparison to related work, we extend the notion of context and thereby increase the applicability of our approach. We find that a combination of both compatibility models, coupled with the coreference model, provides a large potential for improving pronoun resolution performance. We make available all our resources, including a web demo of the system, at: http://pub.cl.uzh.ch/purl/coreference-resolution Abstract Die wichtigsten Beiträgeder vorliegenden Arbeit sind folgende: 1. Die Arbeit führtein generelles Modell zur Koreferenzauflösung ein und exploriert dessen Anwendung auf die deutsche Sprache. • Das Modell verfügtüber einen inkrementellen Algorithmus zur Diskursverar- beitung, der es erlaubt, Unterspezifizierung von Erwähnung von Entitätenauf kohärente Weise zu behandeln, was ein besonderes Problem beim Verarbeiten von deutschen Pronomen darstellt. • Ein Set an neuen Merkmalen fürdie Auflösungvon deutschen Pronomen wird eingeführt.Ein Teil dieser Merkmale wird durch die inkrementelle Architek- tur des Algorithmus zur Diskursverarbeitung zugänglich. • In der Evaluation wird gezeigt, dass das Koreferenzmodell, gekoppelt mit den neuen Merkmalen, neue state-of-the-art-Resultate fürKoreferenzauflösung fürdas Deutsche erreicht. 2. Die Arbeit behandelt die Evaluation von Koreferenz- und Pronomenauflösung. • Die gängigeEvaluation wird aus der Perspektive von Anwendungen beleuchtet, die von Koreferenzauflösungals Vorverarbeitungsschritt profitieren. Ein al- ternativer Evaluationsansatz (ARCS) wird vorgeschlagen, der Defizite in der gängigenEvaluation aufnimmt. • Der vorgeschlagene Evaluationsansatz ermöglicht einen eingehenden Vergle- ich von Systemen und quantifiziert Gemeinsamkeiten und Unterschiede von Systemausgaben in einer Weise, die tiefer greift als die gängigeEvaluation. Dieser Ansatz ermöglicht es Forschenden, ihre Systeme detailliert und unter verschiedenen Gesichtspunkten mit anderen zu vergleichen. 3. Die Arbeit untersucht den Einbezug von Semantik in die Auflösungvon Pronomen. • Innerhalb des eingeführtenKoreferenzmodells wird distributionelle Seman- tik als Ansatz zur Bestimmung der Kompatibilitäteines Antezedenskandi- daten und dem Kontext eines Pronomens exploriert. Zu diesem Zweck wird ein state-of-the-art-Ansatz zur Berechnung von Vektorrepräsentationen von Wörternmit syntaktischen Kookurrenzprofilen von Wörternverglichen. • Im Vergleich zu verwandten Arbeiten wird die Definition des Pronomenkon- texts erweitert und dadurch die Anwendbarkeit des Ansatzes erhöht. Die Evaluation zeigt, dass die Kombination beider Kompatibilitätsmodelle, gekoppelt mit dem eingeführtenKoreferenzmodell, ein hohes Potenzial fürdie Verbesserung der Resultate bezüglich Pronomenauflösungbietet. Die Ressourcen, die ihm Rahmen der Arbeit erstellt wurden, inklusive einer Web-Demo des Systems, sind zugänglich unter: http://pub.cl.uzh.ch/purl/coreference-resolution Acknowledgements Being able to do whatever you want must be considered a privilege. At the Institute of Computational Linguistics of the University of Zurich, I felt that I was granted this privilege while I investigated the subject of this dissertation. For entrusting me with this privilege, I am entirely grateful to Dr. Manfred Klenner, the main supervisor of this thesis, and Prof. Dr. Martin Volk, the head of the institute. I am indebted to Manfred Klenner for supporting my research from the get-go, which started with an innocent little PROLOG programming project for pronoun resolution and has led to the completion of this thesis. His input and feedback were invaluable to this thesis. Martin Volk, by seeing and connecting the dots at the right moment, provided the financial (but also moral) support for this thesis. His motion to support collaborations of students and senior members of the institute made this thesis possible. Being able to do whatever you want also has a sort of negative connotation, and perhaps rightfully so. This thesis was mainly written detached of any major project. This comes with the advantages of freedom of project deadlines and milestones, and independence from other project members. It also comes with the disadvantages of freedom of project deadlines and milestones, and independence from other project members - it requires a considerable amount of self-discipline and self-motivation. I am grateful to my family, especially my wife Tamara and our two PhD production babies, Siro and Mona, for keeping me motivated and determined. With that in mind, I dedicate this thesis to them. I would also like to thank everyone at the Institute of Computational Linguistics at the University of Zurich for the companionship while researching and writing this thesis. A special thanks goes to Dr. Simon Clematide, who introduced me to wapiti, and Johannes Graënwho kept the institute servers running. Finally, I thank PD Dr. Gerold Schneider and Prof. Dr. Manfred Stede who agreed to review this thesis and who accepted my invitation into the PhD committee. v Contents Abstract iii Zusammenfassung iv Acknowledgementsv Contents vi List of Figures xi List of Tables xiii 1 Introduction1 1.1 Problem description..............................1 1.2 Research questions and hypotheses......................2 1.3 Thesis outline..................................5 1.4 Coreference and anaphoricity.........................6 1.4.1 A brief linguistic introduction.....................6 1.4.1.1 Re-occurrence of words...................6 1.4.1.2 Word substitution......................7 1.4.1.3 Pronominalization......................8 1.4.2 Significance for subsequent applications in CL and NLP......9 1.5 Architecture of automated coreference resolution systems......... 10 1.5.1 Gold standard annotation....................... 11 1.5.2 Preprocessing.............................. 12 1.5.3 Markable extraction.......................... 13 1.5.4 Markable resolution.......................... 15 1.6 Chapter summary............................... 16 2 Discourse processing models for coreference resolution 17 2.1 Mention-pair model............................... 17 2.1.1 Issues of the mention-pair model................... 19 2.1.1.1 Underspecification of antecedent candidates........ 19 2.1.1.2 Redundant instances and skewed training sets...... 20 2.2 Overcoming the limitations of the mention-pair model........... 21 2.2.1 Mention ranking model........................ 22 vii Contents viii 2.2.2 Mention clustering models....................... 23 2.2.3 Entity-mention models......................... 25 2.3 Our incremental entity-mention model for German coreference resolution............................. 29 2.4 Chapter summary............................... 35 3 Coreference resolution evaluation 37 3.1 Issues in evaluation from the perspective of higher-level applications... 37 3.2 The ARCS evaluation framework for coreference resolution........ 39 3.2.1 Quantification of the difference between system responses..... 42 3.2.2 Assessment of feature potential.................... 48 3.2.3 Comparison of multiple system responses.............. 48 3.3 Error analysis in coreference resolution.................... 50 3.4 Evaluation of pronoun resolution....................... 52 3.4.1 Ratio-based evaluation........................

Load more