
Incremental Coreference Resolution for German Thesis presented to the Faculty of Arts and Social Sciences of the University of Zurich for the degree of Doctor of Philosophy by Don Tuggener Accepted in the spring semester 2016 on the recommendation of the doctoral committee: Prof. Dr. Martin Volk (main advisor) PD Dr. Gerold Schneider Zurich, 2016 \Choose a job you love, and you will never have to work a day in your life." Confucius Abstract The main contributions of this thesis are as follows: 1. We introduce a general model for coreference and explore its application to Ger- man. • The model features an incremental discourse processing algorithm which al- lows it to coherently address issues caused by underspecification of mentions, which is an especially pressing problem regarding certain German pronouns. • We introduce novel features relevant for the resolution of German pronouns. A subset of these features are made accessible through the incremental archi- tecture of the discourse processing model. • In evaluation, we show that the coreference model combined with our features provides new state-of-the-art results for coreference and pronoun resolution for German. 2. We elaborate on the evaluation of coreference and pronoun resolution. • We discuss evaluation from the view of prospective downstream applications that benefit from coreference resolution as a preprocessing component. Ad- dressing the shortcomings of the general evaluation framework in this regard, we introduce an alternative framework, the Application Related Coreference Scores (ARCS). • The ARCS framework enables a thorough comparison of different system outputs and the quantification of their similarities and differences beyond the common coreference evaluation. We demonstrate how the framework is applied to state-of-the-art coreference systems. This provides a method to track specific differences in system outputs, which assists researchers in comparing their approaches to related work in detail. 3. We explore semantics for pronoun resolution. • Within the introduced coreference model, we explore distributional approaches to estimate the compatibility of an antecedent candidate and the occurrence context of a pronoun. We compare a state-of-the-art approach for word em- beddings to syntactic co-occurrence profiles to this end. • In comparison to related work, we extend the notion of context and thereby increase the applicability of our approach. We find that a combination of both compatibility models, coupled with the coreference model, provides a large potential for improving pronoun resolution performance. We make available all our resources, including a web demo of the system, at: http://pub.cl.uzh.ch/purl/coreference-resolution Abstract Die wichtigsten Beitr¨ageder vorliegenden Arbeit sind folgende: 1. Die Arbeit f¨uhrtein generelles Modell zur Koreferenzaufl¨osung ein und exploriert dessen Anwendung auf die deutsche Sprache. • Das Modell verf¨ugt¨uber einen inkrementellen Algorithmus zur Diskursverar- beitung, der es erlaubt, Unterspezifizierung von Erw¨ahnung von Entit¨atenauf koh¨arente Weise zu behandeln, was ein besonderes Problem beim Verarbeiten von deutschen Pronomen darstellt. • Ein Set an neuen Merkmalen f¨urdie Aufl¨osungvon deutschen Pronomen wird eingef¨uhrt.Ein Teil dieser Merkmale wird durch die inkrementelle Architek- tur des Algorithmus zur Diskursverarbeitung zug¨anglich. • In der Evaluation wird gezeigt, dass das Koreferenzmodell, gekoppelt mit den neuen Merkmalen, neue state-of-the-art-Resultate f¨urKoreferenzaufl¨osung f¨urdas Deutsche erreicht. 2. Die Arbeit behandelt die Evaluation von Koreferenz- und Pronomenaufl¨osung. • Die g¨angigeEvaluation wird aus der Perspektive von Anwendungen beleuchtet, die von Koreferenzaufl¨osungals Vorverarbeitungsschritt profitieren. Ein al- ternativer Evaluationsansatz (ARCS) wird vorgeschlagen, der Defizite in der g¨angigenEvaluation aufnimmt. • Der vorgeschlagene Evaluationsansatz erm¨oglicht einen eingehenden Vergle- ich von Systemen und quantifiziert Gemeinsamkeiten und Unterschiede von Systemausgaben in einer Weise, die tiefer greift als die g¨angigeEvaluation. Dieser Ansatz erm¨oglicht es Forschenden, ihre Systeme detailliert und unter verschiedenen Gesichtspunkten mit anderen zu vergleichen. 3. Die Arbeit untersucht den Einbezug von Semantik in die Aufl¨osungvon Pronomen. • Innerhalb des eingef¨uhrtenKoreferenzmodells wird distributionelle Seman- tik als Ansatz zur Bestimmung der Kompatibilit¨ateines Antezedenskandi- daten und dem Kontext eines Pronomens exploriert. Zu diesem Zweck wird ein state-of-the-art-Ansatz zur Berechnung von Vektorrepr¨asentationen von W¨orternmit syntaktischen Kookurrenzprofilen von W¨orternverglichen. • Im Vergleich zu verwandten Arbeiten wird die Definition des Pronomenkon- texts erweitert und dadurch die Anwendbarkeit des Ansatzes erh¨oht. Die Evaluation zeigt, dass die Kombination beider Kompatibilit¨atsmodelle, gekop- pelt mit dem eingef¨uhrtenKoreferenzmodell, ein hohes Potenzial f¨urdie Verbesserung der Resultate bez¨uglich Pronomenaufl¨osungbietet. Die Ressourcen, die ihm Rahmen der Arbeit erstellt wurden, inklusive einer Web-Demo des Systems, sind zug¨anglich unter: http://pub.cl.uzh.ch/purl/coreference-resolution Acknowledgements Being able to do whatever you want must be considered a privilege. At the Institute of Computational Linguistics of the University of Zurich, I felt that I was granted this privilege while I investigated the subject of this dissertation. For entrusting me with this privilege, I am entirely grateful to Dr. Manfred Klenner, the main supervisor of this thesis, and Prof. Dr. Martin Volk, the head of the institute. I am indebted to Manfred Klenner for supporting my research from the get-go, which started with an innocent little PROLOG programming project for pronoun resolution and has led to the completion of this thesis. His input and feedback were invaluable to this thesis. Martin Volk, by seeing and connecting the dots at the right moment, provided the financial (but also moral) support for this thesis. His motion to support collaborations of students and senior members of the institute made this thesis possible. Being able to do whatever you want also has a sort of negative connotation, and perhaps rightfully so. This thesis was mainly written detached of any major project. This comes with the advantages of freedom of project deadlines and milestones, and independence from other project members. It also comes with the disadvantages of freedom of project deadlines and milestones, and independence from other project members - it requires a considerable amount of self-discipline and self-motivation. I am grateful to my family, especially my wife Tamara and our two PhD production babies, Siro and Mona, for keeping me motivated and determined. With that in mind, I dedicate this thesis to them. I would also like to thank everyone at the Institute of Computational Linguistics at the University of Zurich for the companionship while researching and writing this thesis. A special thanks goes to Dr. Simon Clematide, who introduced me to wapiti, and Johannes Gra¨enwho kept the institute servers running. Finally, I thank PD Dr. Gerold Schneider and Prof. Dr. Manfred Stede who agreed to review this thesis and who accepted my invitation into the PhD committee. v Contents Abstract iii Zusammenfassung iv Acknowledgementsv Contents vi List of Figures xi List of Tables xiii 1 Introduction1 1.1 Problem description..............................1 1.2 Research questions and hypotheses......................2 1.3 Thesis outline..................................5 1.4 Coreference and anaphoricity.........................6 1.4.1 A brief linguistic introduction.....................6 1.4.1.1 Re-occurrence of words...................6 1.4.1.2 Word substitution......................7 1.4.1.3 Pronominalization......................8 1.4.2 Significance for subsequent applications in CL and NLP......9 1.5 Architecture of automated coreference resolution systems......... 10 1.5.1 Gold standard annotation....................... 11 1.5.2 Preprocessing.............................. 12 1.5.3 Markable extraction.......................... 13 1.5.4 Markable resolution.......................... 15 1.6 Chapter summary............................... 16 2 Discourse processing models for coreference resolution 17 2.1 Mention-pair model............................... 17 2.1.1 Issues of the mention-pair model................... 19 2.1.1.1 Underspecification of antecedent candidates........ 19 2.1.1.2 Redundant instances and skewed training sets...... 20 2.2 Overcoming the limitations of the mention-pair model........... 21 2.2.1 Mention ranking model........................ 22 vii Contents viii 2.2.2 Mention clustering models....................... 23 2.2.3 Entity-mention models......................... 25 2.3 Our incremental entity-mention model for German coreference resolution............................. 29 2.4 Chapter summary............................... 35 3 Coreference resolution evaluation 37 3.1 Issues in evaluation from the perspective of higher-level applications... 37 3.2 The ARCS evaluation framework for coreference resolution........ 39 3.2.1 Quantification of the difference between system responses..... 42 3.2.2 Assessment of feature potential.................... 48 3.2.3 Comparison of multiple system responses.............. 48 3.3 Error analysis in coreference resolution.................... 50 3.4 Evaluation of pronoun resolution....................... 52 3.4.1 Ratio-based evaluation........................
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages207 Page
-
File Size-