Coreference: Theory, Annotation, Resolution and Evaluation

Coreference: Theory, Annotation, Resolution and Evaluation by Marta Recasens Potau A Dissertation Presented to the Doctoral Program in Linguistics and Communication, Department of Linguistics, University of Barcelona, in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy under the supervision of Dr. M. Antonia` Mart´ı Anton´ın University of Barcelona Dr. Eduard Hovy ISI – University of Southern California University of Barcelona September 2010 Coreferencia:` Teoria, anotacio,´ resolucio´ i avaluacio´ per Marta Recasens Potau Memoria` presentada dins del Programa de Doctorat Lingu¨´ıstica i Comunicacio´, Bienni 2006–2008, Departament de Lingu¨´ıstica General, Universitat de Barcelona, per optar al grau de Doctor sota la direccio´ de Dra. M. Antonia` Mart´ı Anton´ın Universitat de Barcelona Dr. Eduard Hovy ISI – University of Southern California Universitat de Barcelona Setembre 2010 Counting words isn’t very revealing if you aren’t listening to them, too. – Geoffrey Nunberg “The I’s Don’t Have It”, Fresh Air (November 17, 2009) Usage factors reveal language as a natural, organic social instrument, not an abstract logical one. – Joan Bybee Language, Usage and Cognition (2010:193) iii iv To my loved ones, Merce,` Eduard, Elm and Mark, for being there. v vi Abstract Coreference relations, as commonly defined, occur between linguistic expressions that refer to the same person, object or event. Resolving them is an integral part of discourse comprehension by allowing language users to connect the pieces of discourse information concerning the same entity. Consequently, coreference resolution has become a major focus of attention in natural language processing as its own task. Despite the wealth of existing research, current performance of coreference resolution systems has not reached a satisfactory level. The thesis is broadly divided into two parts. In the first part, I examine three separate but closely related aspects of the coreference resolution task, namely (i) the encoding of coreference relations in large electronic corpora, (ii) the de- velopment of learning-based coreference resolution systems, and (iii) the scoring and evaluation of coreference systems. Throughout this research, insight is gained into foundational problems in the coreference resolution task that pose obstacles to its feasibility. Hence, my main contribution resides in a critical but constructive analysis of various aspects of the coreference task that, in the second part of the thesis, leads to rethink the concept of coreference itself. First, the annotation of the Spanish and Catalan AnCora corpora (totaling nearly 800k words) with coreference information reveals that the concept of referentiality is not a clear-cut one, and that some relations encountered in real data do not fit the prevailing either-or view of coreference. Degrees of referentiality as well as relations that do not fall neatly into either coreference or non-coreference—or that accept both interpretations—are a major reason for the lack of inter-coder agree- ment in coreference annotation. Second, experiments on the contribution of over forty-five learning features to coreference resolution show that, although the extended set of linguistically mo- tivated features results in an overall significant improvement, this is smaller than vii expected. In contrast, the simple head-match feature alone succeeds in obtaining a quite satisfactory score. It emerges that head match is one of the few features suffi- ciently represented for machine learning to work. The complex interplay between factors, and the fact that pragmatics and world knowledge do not lend themselves to be captured systematically in the form of pairwise learning features, are indi- cators that the way machine learning is currently applied may not be well suited to the coreference task. I advocate for entity-based systems like the one presented in this thesis, CISTELL, as the model best suited to address the coreference problem. CISTELL allows not only the accumulation and carrying of information from “inside” the text, but also the storing of background and world knowledge from “outside” the text. Third, further experiments, as well as the SemEval shared task, demonstrate that the current evaluation of coreference resolution systems is obscured by a number of factors including variations in the task definition, the use of gold-standard or automatically predicted mention boundaries, and the disagreement between the system rankings produced by the widely-used evaluation metrics (MUC, B3, CEAF). The imbalance between the number of singletons and multi-mention entities in the data accounts for measurement biases toward either over- or under- clustering. The BLANC measure that I propose, which is a modified implementa- tion of the Rand index, addresses this imbalance by dividing the score into coreference and non-coreference links. Finally, the second part of the thesis concludes that abandoning the traditional categorical understanding of coreference is the first step to further the state of the art. To this end, the notion of near-identity is introduced within a continuum model of coreference. From a cognitive perspective, I argue for the variable granular- ity level at which discourse entities can be conceived. It is posited that three different categorization operations—specification, refocusing and neutralization— govern the shifts that discourse entities undergo as a discourse evolves and so ac- count for (near-)coreference relations. This new continuum model provides sound theoretical foundations to the coreference problem, both for the linguistic and computational fields. viii Resum Les relacions de coreferencia,` segons la definicio´ mes´ comuna, s’estableixen entre expressions lingu¨´ıstiques que es refereixen a una mateixa persona, objecte o es- deveniment. Resoldre-les es´ una part integral de la comprensio´ del discurs ja que permet als usuaris de la llengua connectar les parts del discurs que contenen informacio´ sobre una mateixa entitat. En consequëncia,` la resolucio´ de la coreferencia` ha estat un focus d’atencio´ destacat del processament del llenguatge natural, on te´ una tasca propia.` Tanmateix, malgrat la gran quantitat de recerca existent, els resultats dels sistemes actuals de resolucio´ de la coreferencia` no han assolit un nivell satisfactori. La tesi es divideix en dos grans blocs. En el primer, examino tres aspectes diferents pero` estretament relacionats de la tasca de resolucio´ de la coreferencia:` (i) l’anotacio´ de relacions de coreferencia` en grans corpus electronics,` (ii) el de- senvolupament de sistemes de resolucio´ de la coreferencia` basats en aprenentatge automatic` i (iii) la qualificacio´ i avaluacio´ dels sistemes de coreferencia.` En el transcurs d’aquesta investigacio,´ es fa evident que la tasca de coreferencia` presenta una serie` de problemes de base que constitueixen veritables obstacles per a la seva correcta resolucio.´ Per aixo,` la meva aportacio´ principal es´ una analisi` cr´ıtica i alhora constructiva de diferents aspectes de la tasca de coreferencia` que finalment condueix, en el segon bloc de la tesi, al replantejament del concepte mateix de coreferencia` . En primer lloc, l’anotacio´ amb coreferencia` dels corpus AnCora del castella` i el catala` (un total de 800.000 paraules) posa al descobert, d’una banda, que el concepte de referencialitat no esta` clarament delimitat i, d’una altra, que algunes relacions observades en dades d’us´ real no encaixen dins la visio´ de la coreferencia` entesa en termes dicotomics.` Tant els graus de referencialitat com les relacions ix que no son´ ni coreferencials ni no coreferencials (o que accepten totes dues inter- pretacions) son´ una de les raons principals que dificulten assolir un alt grau d’acord entre els anotadors d’aquesta tasca. En segon lloc, els experiments realitzats sobre la contribucio´ de mes´ de quaranta- cinc trets d’aprenentage automatic` a la resolucio´ de la coreferencia` mostren que, tot i que el ventall de trets motivats lingu¨´ısticament porta a una millora significativa general, aquesta es´ mes´ petita que l’esperada. En canvi, el senzill tret de mateix- nucli (head match) aconsegueix tot sol resultats prou satisfactoris. D’aixo` se’n despren` que es tracta d’un dels pocs trets suficientment representats per al bon fun- cionament de l’aprenentatge automatic.` La interaccio´ complexa que es dona´ entre els diversos factors aix´ı com el fet que el coneixement pragmatic` i del mon´ no es deixa representar sistematicament` en forma de trets d’aprenentatge de parells de mencions son´ indicadors que la manera en que` actualment s’aplica l’aprenentatge automatic` pot no ser especialment idonia` per a la tasca de coreferencia.` Per aixo,` considero que el millor model per adreçar el problema de la coreferencia` corre- spon als sistemes basats en entitats com CISTELL, que presento a la tesi. Aquest sistema permet no nomes´ emmagatzemar informacio´ de “dins” del text sino´ tambe´ recollir coneixement general i del mon´ de “fora” del text. En tercer lloc, altres experiments aix´ı com la tasca compartida del SemEval demostren l’existencia` de diversos factors que questionen¨ la manera en que` actualment s’avaluen els sistemes de resolucio´ de la coreferencia.` Es tracta de varia- cions en la definicio´ de la tasca, l’extraccio´ de mencions a partir de l’estandard` de referencia` o predites automaticament,` i el desacord entre els ranquings` de sistemes donats per les metriques` d’avaluacio´ mes´ utilitzades (MUC, B3, CEAF). La desigualtat entre el nombre d’entitats unaries` i el nombre d’entitats de multiples´ mencions explica el biaix de les mesures o be´ cap a un deficit` o be´ cap a un exces´ de clusters. La mesura BLANC que proposo, una implementacio´ modificada de l’´ındex de Rand, corregeix aquest desequilibri dividint la puntuacio´ final entre relacions de coreferencia` i de no coreferencia.` Finalment, la segona part de la tesi arriba a la conclusio´ que l’abando´ de la visio´ tradicional i dicotomica` de la coreferencia` es´ el primer pas per anar mes´ enlla` de l’estat de l’art.

Coreference: Theory, Annotation, Resolution and Evaluation

Questions Under Discussion: from Sentence to Discourse∗

Coreference Resolution for Spoken French Loïc Grobol

Mind the GAP: a Balanced Corpus of Gendered Ambiguous Pronouns

Constraints on Donkey Pronouns

Arxiv:1805.11824V1 [Cs.CL] 30 May 2018

Event Coreference Resolution: a Survey of Two Decades of Research

Binding and Coreference in Vietnamese

Long-Distance Reflexivization and Logophoricity in the Dargin Language Muminat Kerimova Florida International University

Modeling the Lifespan of Discourse Entities with Application to Coreference Resolution

1 Discourse and Logical Form: Pronouns, Attention and Coherence

Discourse-Givenness of Noun Phrases Theoretical and Computational Models

Lecture 5. Introduction to Issues in Anaphora 1. What Is Anaphora? 2