Resolving Coreferent Bridging in German Newspaper Text Von
Total Page:16
File Type:pdf, Size:1020Kb
Resolving Coreferent Bridging in German Newspaper Text von Yannick Versley Dissertation angenommen von der Philosophischen Fakultat¨ (alt: Neuphilologischen Fakultat)¨ der Universitat¨ Tubingen¨ am 19. Juli 2010 Tubingen,¨ 2011 2 Gedruckt mit Genehmigung der Philosophischen Fakultat¨ der Universitat¨ Tubingen¨ Hauptberichterstatter: Prof. Dr. Erhard Hinrichs Mitberichterstatterin: Prof. Dr. Sandra Kubler¨ Dekan (Zeitpunkt des Kolloquiums): Prof. Dr. Joachim Knape Dekan (Zeitpunkt der Drucklegung): Prof. Dr. Jurgen¨ Leonhardt Preface and Acknowledgements Writing a dissertation is a kind of Bildungsroman in itself, sketching the develop- ment of a single figure and written from this perspective – including the acceptance of a certain kind of formational ideal, interaction with the environment, moments of anguish and despair, and, finally, the cherished illusion of clarity. While writ- ten from a singular perspective, a dissertation (and the person who wrote it) owes much to, and cannot exist without, mentors, colleagues, and other kind people. First of all, the two principal investigators of the A1 project in the SFB 441 (“Reprasentation¨ und Erschließung linguistischer Daten”), Erhard Hinrichs and Sandra Kubler,¨ deserve thanks for offering me the opportunity to pursue a PhD in Tubingen,¨ and for playing the role of the dissertation committee. In particular, this dissertation would not have seen the light in 2010 without the benevolent and patient supervision of Erhard Hinrichs. It is due to his encouragement that this thesis is not just a cumulation of scientific facts but also offers an inviting narrative for the curious reader. Sandra Kubler,¨ in her time in Tubingen,¨ tirelessly listened to any problems and uncertainties that put themselves in the way, despite having lots of her own work, and always offered sensible and insightful advice. My time in Tubingen¨ would have been rather dull, or worse, without all the col- leagues, co-workers and friends: Starting from the far end of the alphabet, Heike Zinsmeister helped me to discover that, despite all the vagaries in general and aca- demic life, there is a purpose to it all that you should try and discover. Holger Wunsch, first my office mate and then my floor neighbour, helped me feel at home in Tubingen¨ early on with his cheerful demeanor. Piklu Gupta, in his own way contributed not only his perspective on things, but was also very adept in listening to, and understanding other perspectives, helping create a coherent picture. In the larger environment, lots of people have shared their perspectives on life, research, and sometimes mensa food – Kathrin Beck, Stephan Kepser, Monica Lau,˘ Lothar Lemnitzer, Daniela Marzo, Frank Muller-Witte,¨ Verena Rube, Ilona Steiner, Thomas Zastrow, Bettina Zeisler, among others. The members of the new SFB 833 project, namely Stefanie Simon, Sabrina Schulze and Anna Gastel, deserve thanks for enduring the anguish and brooding of a dissertation in its final stadium. With several people outside of Tubingen¨ – to name a few, Sabine Schulte im Walde, Stefan Evert, Marco Baroni, Massimo Poesio, Olga Uryupina, Josef van Genabith, Ines Rehbein – I had stimulating discussions, for which I am particularly grateful. 4 Contents 1 Introduction9 1.1 The Problem................................9 1.2 Scope of the thesis............................ 11 1.3 Coreference and Anaphora........................ 13 1.3.1 Notational Conventions..................... 16 1.4 Anaphora and Bridging......................... 17 1.5 Overview of the Thesis.......................... 21 2 Coreference and Definite Descriptions 23 2.1 A linguistic perspective on coreference................. 23 2.2 Discourse-old and Definite........................ 26 2.2.1 Definiteness as Uniqueness................... 27 2.2.2 Definiteness as Familiarity................... 29 2.2.3 Definites and Functional Concepts............... 32 2.3 Constraints on Antecedence....................... 34 2.3.1 Presupposition as Anaphora.................. 35 2.3.2 Interaction with Discourse Relations............. 37 2.3.3 Overspecific Noun Phrases................... 38 2.4 Cognitive status and the choice of NP form.............. 41 2.5 Instances of Referential Ambiguity................... 44 2.5.1 Co-reference of Vague Entities................. 45 2.5.2 Reference in Blended Spaces.................. 47 2.5.3 Incompatible refinements to a vague description....... 49 2.6 Summary................................. 53 3 Coreference Corpora and Evaluation 55 3.1 Available Corpora and their Annotation................ 56 3.1.1 The MUC scheme........................ 57 3.1.2 The MATE coreference scheme................ 58 3.1.3 The ACE coreference task................... 60 3.1.4 Coreference annotation in the TuBa-D/Z¨ corpus....... 61 3.1.5 Other corpora.......................... 63 3.2 Evaluation................................. 64 6 CONTENTS 3.2.1 Link-based measures...................... 66 3.2.2 Set-based measures....................... 67 3.2.3 Alignment-based measures................... 71 3.2.4 Properties of Evaluation Metrics................ 73 3.2.5 Evaluation setting........................ 74 3.3 Summary................................. 78 4 Approaches to Coreference Resolution 81 4.1 Towards Modern Coreference Resolution............... 84 4.1.1 Rule-based approaches to Coreference Resolution...... 87 4.1.2 Introducing Machine Learning................. 88 4.2 Refining Machine Learning Approaches................ 92 4.2.1 Approaches to Antecedent Selection............. 93 4.2.2 Global Models of Coreference................. 94 4.2.3 Anaphoricity determination.................. 97 4.3 Integration of Semantic Features.................... 100 4.3.1 Using WordNet......................... 102 4.3.2 Acquisition from Text...................... 105 4.4 The State of German Coreference Resolution............. 114 4.4.1 Pronoun Resolution....................... 115 4.4.2 Hartrumpf 2001......................... 116 4.4.3 Strube/Rapp/Muller¨ 2002.................... 116 4.4.4 Klenner and Ailloud 2008................... 117 4.5 Summary: Issues in Coreference Resolution.............. 118 5 Resources for German 121 5.1 Lexical information............................ 122 5.1.1 Lemmatizers and Morphological analyzers.......... 122 5.1.2 Semantic Lexicons....................... 123 5.1.3 Gazetteers............................ 126 5.2 Corpora.................................. 127 5.2.1 Manually annotated Corpora.................. 128 5.2.2 Corpora with automatic Annotation.............. 133 5.3 Processing tools.............................. 134 5.3.1 Part-of-speech tagging..................... 134 5.3.2 Shallow and Incremental Parsing Approaches........ 135 5.3.3 Parsing by Global Optimization................ 137 5.4 Summary................................. 145 6 Semantic Compatibility for Coreference 147 6.1 Corpus-based Similarity and Association Measures.......... 148 6.1.1 Distributional Similarity Measures.............. 149 6.1.2 Pattern-based approaches.................... 163 6.1.3 Cross-sentence associations.................. 167 CONTENTS 7 6.2 Corpus-based Semantic Features for German............. 168 6.2.1 Parsing the taz Corpus..................... 169 6.2.2 Extracting Grammatical Relations............... 181 6.2.3 Compound splitting....................... 187 6.2.4 An Efficient Implementation of Distributional Similarity.. 191 6.2.5 Similarity Measures Induced from the taz Corpus...... 192 6.3 Additional Semantic Features...................... 196 6.3.1 Pattern search for German................... 196 6.3.2 Semantic Classes........................ 198 6.4 Experiments on Antecedent Selection................. 204 6.5 Summary................................. 209 7 Description of the TUECOREF System 211 7.1 Coreference Resolution Framework.................. 211 7.1.1 Mention Identification..................... 212 7.1.2 System Architecture....................... 214 7.1.3 General setting and Same-head resolution.......... 215 7.1.4 Learning a ranking resolver.................. 222 7.2 Resolving coreferent bridging...................... 225 7.2.1 Beyond semantic similarity................... 231 7.2.2 Discussion of the Evaluation Results............. 234 8 Conclusions 237 8 CONTENTS Chapter 1 Introduction 1.1 The Problem Anaphoric descriptions (such as pronouns) and rigid designators (such as names) make it possible to connect the contents of multiple sentences in a discourse and therefore are important means for the structuring of the discourse beyond the level of single sentences. Reconstructing the sets of mentions that relate to the same entity in a discourse model by resolving anaphora and names in a text, so-called coreference resolution, has been proven to be useful for a wide range of higher- level tasks such as question answering (Morton, 2000), summarization (Steinberger et al., 2005) and information extraction (McCarthy and Lehnert, 1995). Consider the following stretch of discourse1: (1.1)[ 1 President Clinton] had planned weeks ago to devote yesterday to build- ing up public interest in next week’s State of the Union address. Instead, [1 he] spent [1 his] afternoon with a revolving door of reporters, in a campaign to keep [1 his] presidency from buckling under the force of allegations about [1 his] relationship with [2 a former White House intern]. In a remarkable series of three interviews in which [1 the president] was questioned bluntly and without apology about adultery and obstruction of justice alike, [1 Clinton] denied