Predicting the Correct Entities in Named-Entity Linking
Total Page:16
File Type:pdf, Size:1020Kb
Separating the Signal from the Noise: Predicting the Correct Entities in Named-Entity Linking Drew Perkins Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ects credits June 9, 2020 Supervisors: Gongbo Tang, Uppsala University Thorsten Jacobs, Seavus Abstract In this study, I constructed a named-entity linking system that maps between contextual word embeddings and knowledge graph embeddings to predict correct entities. To establish a named-entity linking system, I rst applied named-entity recognition to identify the entities of interest. I then performed candidate gener- ation via locality sensitivity hashing (LSH), where a candidate group of potential entities were created for each identied entity. Afterwards, my named-entity dis- ambiguation component was performed to select the most probable candidate. By concatenating contextual word embeddings and knowledge graph embeddings in my disambiguation component, I present a novel approach to named-entity link- ing. I conducted the experiments with the Kensho-Derived Wikimedia Dataset and the AIDA CoNLL-YAGO Dataset; the former dataset was used for deployment and the later is a benchmark dataset for entity linking tasks. Three deep learning models were evaluated on the named-entity disambiguation component with dierent context embeddings. The evaluation was treated as a classication task, where I trained my models to select the correct entity from a list of candidates. By optimizing the named-entity linking through this methodology, this entire system can be used in recommendation engines with high F1 of 86% using the former dataset. With the benchmark dataset, the proposed method is able to achieve F1 of 79%. Contents Acknowledgments5 1. Introduction6 1.1. Purpose and Motivation..........................6 1.2. Outline...................................7 2. Background8 2.1. Graph Theory and Concepts.......................8 2.1.1. Algorithms............................8 2.2. Knowledge Graphs............................ 10 2.2.1. Knowledge Representation.................... 10 2.2.2. Knowledge Bases......................... 10 2.3. Named-Entity Linking Components................... 11 2.3.1. Named-Entity Recognition.................... 11 2.3.2. Candidate Generation via Locality Sensitivity Hashing.... 11 2.3.3. Named-Entity Disambiguation................. 12 2.4. Feature Embeddings............................ 13 2.4.1. Word Embeddings........................ 13 2.4.2. Graph Embeddings........................ 14 2.5. Neural Networks and Deep Learning.................. 16 2.5.1. Long Short Term Memory (LSTM)............... 16 2.5.2. Convolutional Neural Networks (CNN)............. 16 2.5.3. Contextual Embeddings from Language Models (ELMo)... 17 3. Methodology 19 3.1. Named-Entity Linking.......................... 19 3.2. Disambiguation Models.......................... 20 3.2.1. Embeddings............................ 20 3.2.2. BiLSTM Model.......................... 21 3.2.3. CNN-BiLSTM Model....................... 21 3.2.4. ELMo Model........................... 21 3.2.5. Feed-Forward Neural Network (FFNN)............. 22 4. Experiments 23 4.1. Data..................................... 23 4.1.1. Kensho-Derived Wikimedia Dataset (KDWD)......... 23 4.1.2. AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA)........ 24 4.2. Settings................................... 24 4.3. Evaluation Metrics............................. 26 4.3.1. Precision and Recall....................... 26 4.3.2. Classication Metrics....................... 26 4.4. Results................................... 27 4.4.1. The Eect of Training Data Size................. 27 4.4.2. Disambiguation Models..................... 27 4.4.3. Candidate List Accuracy..................... 28 4.4.4. Analysis and Discussion..................... 29 3 5. Conclusion and Future Work 31 A. Named-Entity Linking Examples 33 4 Acknowledgments I would like to thank my university supervisor Gongbo Tang for his help and guidance in the structure and pragmatics of my thesis. I am deeply indebted to the Seavus AB team, particularly my company supervisor Thorsten Jacobs, for their support, feedback, resources, ideas, and deadlines for this thesis. I would like to thank COVID-19 for destroying my social life in the months leading up to completing my thesis. Finally, I would like to thank my family, friends, and girlfriend for their ongoing support during my Master’s studies. Seavus AB is an IT consulting rm that provides enterprise-wide business solutions across the world, mainly covering the US and European markets. The department I conducted this work in was their articial intelligence and machine learning division located in Stockholm. Their current work includes but is not limited to chatbots, QA systems, and business intelligence. 5 1. Introduction "I am convinced that the crux of the problem of learning is recognizing relationships and being able to use them" Christopher Strachey in a letter to Alan Turing, 1954 Knowledge graphs have been exploding in recent years within the scope of natural language processing. Whether it be natural language generation, question-answering, or named-entity recognition and relation linking, when common natural language tasks are leveraged with knowledge graphs, improvements can be made across tasks and domains. That is why I sought to construct a named-entity linking system, whereupon ambiguities in the named-entities can be detected and properly claried with assistance from knowledge graphs. On some initial work with named-entity recognition of a corpus, I noticed "Bush" would come up several times without any clarity to whether this person was in reference to "George H.W. Bush" or "George W. Bush". For our purposes, this seemed like a glaring oversight and one that we chose to expand on to nd a proper solution. One clear method to solve this problem is through named- entity linking. A named-entity linking system consists of three primary components: named-entity recognition, candidate generation, and named-entity disambiguation. With all three of these components, we eectively identify entities, construct a list of possible candidate for the identied entities, and nally disambiguate these entities from the candidate list and link to a distinctive identier within a knowledge graph. It is this nal component – named-entity disambiguation – that I focused on for my research and evaluation. The data I originally trained the disambiguation component on was the Kensho-Derived Wikimedia Dataset, which includes Wikipedia text, links, and the Wikidata knowledge base. I then conducted further studies with the AIDA CoNLL-YAGO Dataset, a benchmark in named-entity linking. Furthermore, I performed named-entity linking with three disambiguation models that map between contextual word embeddings and knowledge graph embeddings. To optimize named-entity linking with deep learning, I treated the problem as a classication task; the models predict the correct entity among a series of candidates. With my best model performance with the benchmark dataset, I achieved 87% recall, 73% precision, and an ROC-AUC of 83%. The thesis project was conducted at Seavus AB, an IT consulting rm in Stockholm, Sweden. It was founded in Malmo, Sweden, yet has oces throughout Northern and Southeastern Europe. Seavus AB oers state-of-the-art machine learning and articial intelligence services to companies from around the world. 1.1. Purpose and Motivation The purpose of this thesis is to examine and evaluate dierent ways to improve on named-entity linking systems by mapping between context embeddings with knowledge graph embeddings for correct entities; it would appear that there is bereft research into the usage of both of these embeddings together. The motivation came from my initial ndings of ambiguity with certain persons and places, such as in the "Bush" example. Particularly when we are working with news corpora, a last 6 name can often be found alone, leading to more than a mere isolated incident. Named- entity linking is a step toward more accurate semantic representation, and synonym extraction, that remains a challenge in NLP research despite the robust ontology networks and lexicons widely available. Not only that, the relations we make in how we think, what we watch, and who we connect with are all inexorably linked together to how we have come to understand the world, and fundamental to language technology. There are several research questions I considered to help me move forward with my work. Can entities be adequately predicted through the concatenation of knowledge graph embeddings with contextual word embeddings? Most current disambiguation methods rely purely on the contextual information of documents. Can my system manage name variations – i.e., the same entity can appear with various naming conven- tions? This may be caused by aliases, spelling errors, or abbreviations. Can my system manage ambiguity – i.e., the same mention may be polysemous (i.e. have multiple meanings) depending on the specic context? Can my system manage incomplete information when there is a limited amount of knowledge? Will my system be able to ll the contextual gaps? 1.2. Outline The outline of the thesis will consist of the following. In Chapter 2, I will introduce essential graph theory concepts and terminology that will be necessary to understand the rest of the thesis. This will be followed by knowledge graphs, the components used for the named-entity linking system,