A Machine Learning Approach to Anaphora Resolution Including Named Entity Recognition, PP Attachment Disambiguation, and Animacy Detection
Total Page:16
File Type:pdf, Size:1020Kb
A Machine Learning Approach to Anaphora Resolution Including Named Entity Recognition, PP Attachment Disambiguation, and Animacy Detection Anders Nøklestad May 7, 2009 2 For my parents, Randi and Hans Olaf Contents 1 Introduction 13 1.1 The main aim: Automatic anaphora resolution ............. 13 1.2 Support modules .............................. 14 1.2.1 Animacy detection ......................... 15 1.2.2 Named entity recognition ..................... 15 1.2.3 PP attachment disambiguation . ................. 16 1.3 Research questions ............................. 17 1.4 Main contributions ............................. 18 1.5 Some areas of application ......................... 19 1.5.1 Information retrieval ........................ 20 1.5.2 Information extraction ....................... 20 1.5.3 Question answering . ........................ 21 1.5.4 Machine translation ........................ 21 1.5.5 How the present work can be useful ............... 22 2 Applied Data-driven Methods 27 2.1 Overview .................................. 27 2.2 Memory-based learning .......................... 27 2.2.1 Vector representations ....................... 31 2.2.2 Feature weighting ......................... 33 2.2.3 Modified Value Difference Metric (MVDM) ........... 35 2.2.4 Jeffrey Divergence Metric ..................... 36 2.3 Maximum entropy modelling ....................... 36 2.4 Support vector machines .......................... 39 2.5 Advantages of memory-based learning .................. 42 2.5.1 Exceptions are not forgotten ................... 42 2.5.2 Local mapping functions ...................... 43 2.5.3 Leave-one-out testing . ..................... 43 2.5.4 Fast training (but slow application) ............... 44 3 4 CONTENTS 2.5.5 Implicit smoothing ......................... 44 2.5.6 Easy manual inspection ...................... 45 3 Tools and Resources 47 3.1 Introduction ................................. 47 3.2 The Oslo Corpus of Tagged Norwegian Texts .............. 48 3.3 The Oslo-Bergen tagger .......................... 51 3.4 TiMBL ................................... 53 3.5 Zhang Le’s maximum entropy modelling toolkit ............. 54 3.6 SVMlight .................................. 54 3.7 Paramsearch ................................ 55 3.8 Evaluation . ................................. 56 4 Named Entity Recognition 63 4.1 Introduction ................................. 63 4.2 Relevance for anaphora resolution ..................... 64 4.3 Earlier work ................................. 65 4.3.1 Scandinavian NER: The Nomen Nescio network ......... 67 4.4 Named entity categories .......................... 69 4.5 Information sources for named entity recognition ................................. 69 4.6 Training and test corpora ......................... 70 4.7 Overview of the system .......................... 72 4.7.1 The Oslo-Bergen tagger ...................... 72 4.7.2 Disambiguation using data-driven methods . .......... 73 4.7.3 Document-centred post-processing ................ 73 4.8 Experiments ................................. 75 4.9 Results and evaluation ........................... 78 4.9.1 Default settings ........................... 78 4.9.2 Manual parameter optimization .................. 81 4.9.3 Effects of different k-values .................... 82 4.9.4 Effects of individual features ................... 83 4.9.5 Automatic parameter optimization using Paramsearch ..... 84 4.9.6 Replacing MBL with MaxEnt . ................. 88 4.9.7 Varying the size of the training corpus .............. 90 4.10 Conclusions . .............................. 91 5 PP Attachment Disambiguation 97 5.1 Introduction ................................. 97 5.2 Relevance for anaphora resolution ..................... 99 CONTENTS 5 5.3 Disambiguation approach ......................... 99 5.4 Earlier work ................................. 99 5.5 Relevant constructions ........................... 102 5.5.1 Idiomatic expressions ....................... 103 5.5.2 Light verb constructions . ..................... 103 5.5.3 Simultaneous modification ..................... 105 5.6 Treatment of unclear attachment cases .................. 106 5.7 Training, development, and testing corpora ............... 108 5.8 Training and testing ............................ 110 5.8.1 Machine learning features ..................... 111 5.9 Compound analysis ............................. 112 5.10 Automatic parameter optimization .................... 114 5.11 Results . .................................. 116 5.12 Replacing MBL with MaxEnt and SVM ................. 118 5.12.1 Maximum entropy modelling . ................. 118 5.12.2 The support vector machine .................... 119 5.13 Discussion .................................. 120 5.13.1 The performance of human annotators .............. 122 5.14 Further extensions . ........................... 124 5.15 Conclusions ................................. 126 6 Finding Animate Nouns 129 6.1 Introduction ................................. 129 6.2 Earlier work ................................. 130 6.3 Mining the Web with search patterns . ................. 132 6.3.1 The general idea .......................... 133 6.3.2 Search patterns ........................... 133 6.3.3 Mining procedure .......................... 136 6.4 Offline queries with uninstantiated patterns ............... 137 6.4.1 Snippet analysis and noun extraction ............... 138 6.4.2 Some practical considerations ................... 139 6.4.3 Errors ................................ 141 6.5 (Pseudo-)online queries with instantiated patterns ........... 143 6.6 Evaluation .................................. 145 6.7 Conclusions ................................. 147 7 The Field of Pronominal Anaphora Resolution 149 7.1 Introduction ................................. 149 7.1.1 Anaphora .............................. 150 7.1.2 Antecedent types .......................... 151 6 CONTENTS 7.2 Previous work on anaphora resolution .................. 152 7.2.1 Knowledge-intensive vs. knowledge-poor approaches ...... 152 7.2.2 The major directions in knowledge-poor AR ........... 153 7.2.3 Syntax-based approaches ..................... 154 7.2.4 Centering Theory .......................... 155 7.2.5 Factor/indicator-based approaches ................ 164 7.2.6 Statistical approaches ....................... 167 7.2.7 Related work on Norwegian AR .................. 178 7.2.8 Other Scandinavian systems .................... 180 7.3 Conclusions . .............................. 181 8 Norwegian Anaphora Resolution 183 8.1 Introduction ................................. 183 8.2 Anaphors handled by the system ..................... 184 8.3 Corpus . ................................. 185 8.4 Overall architecture ............................ 186 8.5 The machine learning approach ...................... 187 8.5.1 Filters . .............................. 187 8.5.2 Dealing with reflexive and possessive antecedents . ....... 190 8.5.3 Pronoun-specific classifiers ..................... 190 8.5.4 Features ............................... 191 8.5.5 Feature percolation ........................ 205 8.6 Training and testing procedure . .................... 206 8.7 Results .................................... 207 8.7.1 Testing on the development corpus ................ 207 8.7.2 Cross-validation results . .................... 214 8.7.3 Testing features in isolation . ................. 215 8.7.4 Removing single features ..................... 216 8.7.5 The effect of the support modules ................ 217 8.7.6 Testing on a separate test corpus ................. 218 8.8 Problems with Soon et al. ......................... 227 8.8.1 A suggested alternative: Using Cf ranking to cluster classified instances .............................. 228 8.9 The Centering-based algorithm ...................... 229 8.9.1 Overview of the algorithm ..................... 229 8.9.2 The definition of utterance .................... 232 8.9.3 Results and error analysis ..................... 232 8.10 The factor-based approach . ....................... 234 8.10.1 Factors ............................... 235 8.10.2 Modifications to ARN ....................... 235 CONTENTS 7 8.10.3 Results ............................... 236 8.11 Some genre-specific challenges ....................... 238 8.12 Future work ................................. 242 8.12.1 General coreference resolution . .................. 243 8.12.2 Separating referential and pleonastic det ............. 243 8.12.3 Identifying constituents of subordinate clauses ......... 245 8.13 Conclusions ................................. 245 9 Conclusions 251 A Animate nouns found by mining the web 281 8 CONTENTS Preface My desire to write this thesis emerged from some of the needs that have surfaced in the course of my time as a programmer and systems engineer at the Text Laboratory located at the University of Oslo. At the Text Lab, we work on creating language resources and making them available to the community of linguistic researchers, with a main focus on resources for the Norwegian language. Being a relatively small language, Norwegian has not been blessed with the amount of tools and resources that exist for more widespread languages. Hence, I saw the need for many more Norwegian resources, and I thought that someone should take the time to develop some of those resources—and that person could be me. In doing this work, I have been able to combine the creation of much-needed Norwegian language resources with an exploration into a number of very exciting statistical methods for doing