Entity Linking to Wikipedia Grounding Entity Mentions in Natural Language Text Using Thematic Context Distance and Collective Search
Total Page:16
File Type:pdf, Size:1020Kb
Entity Linking to Wikipedia Grounding entity mentions in natural language text using thematic context distance and collective search Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn vorgelegt von Anja Pilz aus Chemnitz Bonn, 2015 Angefertigt mit Genehmigung der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn 1. Gutachter: Prof. Dr. Stefan Wrobel 2. Gutachter: Prof. Dr. Kristian Kersting Tag der Promotion: 21.09.2015 Erscheinungsjahr: 2016 Anja Pilz Fraunhofer Institut für Intelligente Analyse und Informationssysteme IAIS und Rheinische Friedrich-Wilhelms-Universität Bonn, Institut für Informatik III Acknowledgements First and foremost, I would like to thank Prof. Dr. Stefan Wrobel and Prof. Dr. Kristian Kersting for giving me the opportunity to work on my thesis in cooper- ation with the Computer Science Department at the University of Bonn and the Knowledge Discovery Department at Fraunhofer IAIS. I would like to express my sincere gratitude to all people who have helped, inspired and accompanied me during my PhD studies. This thesis would not have been possible without the support from many other people and each of them contributed directly or indirectly to this thesis. I am very grateful for the chance to pursuit my PhD studies in the Text Mining Group at Fraunhofer IAIS and have benefited greatly from the presence of all the supportive and involved people who made this an excellent working environment. These include all the former and current members of the Text Mining, STREAM and CAML groups, especially Melanie Knapp, Gerhard Paaß, Hannes Korte, Siehyun Strobel, Andre Bergholz, Florian Schulz, Kristian Kersting and Thomas Gärtner who all offered ideas, advise and support in many different ways. I am especially grateful to my co-author and mentor Gerhard Paaß whose support, encouragement but also criticism finally led to this thesis. I also want to thank all the fellow PhD students at Fraunhofer IAIS for many fruit- ful discussions, the exchange of ideas and providing glimpses into other interesting machine learning fields: Fabian Hadji, Babak Ahmadi, Marion Neumann, Mirwaes Wahabzada, Katrin Ullrich, Daniel Paurat, Michael Kamp, Thomas Liebig, Olana Missura, Pascal Welke, Mario Boley, and Ahmed Jawad. Even though most of our group is now scattered across the world, I sincerely hope our paths will cross many times in the near and long future. Most importantly I would like to thank Fabian Hadji, Babak Ahmadi and Marion Neumann, not only for critical and helpful discussions but also for their friendly support, backing and being honest friends. Especially Fabian’s invaluable help and encouragement were indispensable for finishing this thesis. Last but not least, I want to thank my family and friends, probably the hardest critics of all, for ceaseless support and asking the seemingly easy questions that turned out to be the most difficult and important to answer. Abstract This thesis proposes new methods for entity linking in natural language text that assigns entity mentions in unstructured natural language text to the semi-structured encyclopedia Wikipedia. Doing so, entity linking grounds a mention to an encyclopedic entry in Wikipedia and embeds it into this Linked-Open-Data hub. This enables a higher level view on single documents, provides hints for further reading and may be used to add details from other sources. Furthermore, enrich- ing text documents with such links simultaneously resolves the ambiguity of entity names. This ambiguity is an unsolved challenge for many text mining applications: one entity may be desig- nated by a multitude of names and every mention may denote a multitude of entities. Resolving the ambiguity of entity names is thus a crucial step for entity based retrieval, an open problem for most information retrieval and extraction tasks. For instance, search engines relying on heuristic string matches often retrieve irrelevant results as they can not satisfyingly resolve ambiguity. Moreover, there is a huge number of entity mentions that can not be linked to Wikipedia since albeit of its size, Wikipedia has a restricted coverage. Earlier and current work often ignored this and consequently all mentions of uncovered entities. Other approaches handle only entity mentions of specific types or are focussed on English as target language. Apart from such restrictions, no method achieves perfect linking performance. These are the tasks approached in this thesis. We introduce new methods for candidate entity retrieval and candidate entity consolidation, the key components to recall and precision, exploiting both the vast amount of structured and unstructured information stored in Wikipedia. First, we propose a new contextual similarity measure based on latent topic distributions inferred from unstructured natural language text. We show that this thematic distance between mention and candidate entity contexts yields a lower linking error rate than purely word based distances. Being language independent, this method enables high performance entity linking in previously neglected languages such as German and French. This approach is especially suitable, albeit not restricted to link person names, the class of mentions with highest ambiguity. We next propose a new candidate retrieval method to enable successful entity linking also for other entities that are not referenced canonically or exhibit the thematic coherence of persons. We introduce collective search that uses the structured information encoded in Wikipedia’s hyperlink graph to arrive at sets of strongly related candidate entities. This enables us to better handle synonymy, one of the hardest problems in entity linking and not thoroughly treated in previous work. We emphasize on general applicability and evaluate this method on a broad collection of benchmark corpora both in a supervised as well as in an unsupervised setting. We show that can- didate enhancement through collective search increases linking performance on nearly all of these corpora and that our method is the most stable compared to other state-of-the-art approaches. Presenting the first unification of diverse performance measures, we also make a step forward to the comparability of entity linking methods. In conclusion, we provide state-of-the-art entity linking methods for nearly all of the current use cases. When it comes to fine-tuning, we note that entity linking has subjective aspects and adap- tions may be necessary depending on the task at hand. Contents 1 Introduction 1 1.1 Overview . .1 1.2 Outline and Contributions . .7 2 Entity Linking: Preliminaries 13 2.1 Notions from Natural Language Processing . 13 2.2 Entity Linking to Wikipedia . 15 2.2.1 Candidate Consolidation and Candidate Retrieval . 20 2.3 Entities in Wikipedia . 21 2.3.1 Textual Descriptions . 21 2.3.2 Entity Categorization . 21 2.3.3 Alias Names for Entities . 23 2.4 Interlinkage of Entities in Wikipedia . 25 2.4.1 Link-Based Relatedness of Entities . 28 2.4.2 Priors derived from Links . 29 2.5 An Overview of Related Work . 31 2.5.1 Focus on Mention and Entity Types . 31 2.5.2 Local and Global Methods . 32 2.5.3 Types of Linking Models . 33 2.5.4 Alternative Resources for Entity Linking . 34 3 Topic Models for Person Linking 37 3.1 Person Name Discrimination . 37 3.2 Contextual Similarity . 38 3.3 Semantic Similarity . 40 3.4 Latent Dirichlet Allocation . 42 3.4.1 Topic Distributions as Context Descriptions . 45 3.5 Semantic Labelling of Entities . 47 3.5.1 Semantic Labels from Categories . 47 3.5.2 Semantic Labels from Topics . 49 3.5.3 Linking as a Ranking Problem . 52 3.5.4 Evaluation . 56 3.6 Thematic Context Distance . 66 3.6.1 Measures for Thematic Context Distance . 67 3.6.2 Linking as a Classification Problem . 71 vii 3.6.3 Wikipedia Reference Datasets . 73 3.6.4 Evaluation . 77 3.7 An Excursion into Named Entity Linking . 91 3.8 Summary . 94 4 Local and Global Search for Entity Linking 97 4.1 General Entity Linking . 98 4.2 Related Work: General Entity Linking . 99 4.3 Wikipedia in an Inverted Index . 109 4.3.1 Lucene as Indexing Framework . 110 4.3.2 Link Index . 113 4.3.3 Entity Index . 113 4.4 Overview: Entity Linking via Search and Ranking . 119 4.5 Mention Enrichment . 120 4.5.1 Name Expansion . 121 4.5.2 Context Representation . 122 4.6 Candidate Retrieval . 123 4.6.1 Collective Search . 124 4.6.2 Prioritized Candidate Retrieval . 131 4.7 Candidate Consolidation . 133 4.8 Evaluation . 136 4.8.1 Benchmark Corpora . 137 4.8.2 Evaluation of Candidate Retrieval . 138 4.8.3 Training the Candidate Consolidation Model . 143 4.8.4 Evaluation of Candidate Consolidation . 144 4.9 Summary . 156 5 Conclusion 159 5.1 Lessons Learned . 161 5.2 Outlook . 162 5.3 Applications . 163 A Algorithm: Pseudo Code for Candidate Retrieval (Stage 1) 169 B Supplementary tables from experimental evaluation 171 List of Tables 1.1 Ambiguity of person names . .4 2.1 Wikipedia categorization . 22 2.2 Wikipedia redirects . 23 3.1 Wikipedia evaluation datasets for English, German and French . 74 3.2 Evaluation: WikiPersonsE ....................... 82 3.3 Evaluation: WikiMiscE ......................... 85 3.4 Evaluation: WTC on WikiPersonsE and WikiMiscE ........ 86 3.5 Evaluation: WikiPersonsG and WikiPersonsF (CVI)........ 89 3.6 Evaluation: WikiPersonsG and WikiPersonsF (CVE)........ 90 4.1 Fields in the entity index IW ...................... 119 4.2 Features for supervised candidate consolidation . 134 4.3 Benchmark corpora by ground truth annotations . 137 4.4 Benchmark corpora by mention type . 138 4.5 Evaluation: search coverage for unsupervised entity linking . 140 4.6 Evaluation: cross coherence weights for unsupervised entity linking . 141 4.7 Average cross coherence of ground truth entities . 142 4.8 Evaluation: performance by search coverage with supervised entity linking .