Luis Espinosa-Anke
Total Page:16
File Type:pdf, Size:1020Kb
“output” — 2017/7/10 — 11:47 — page i — #1 Knowledge Acquisition in the Information Age The Interplay between Lexicography and Natural Language Processing Luis Espinosa-Anke TESI DOCTORAL UPF / ANY 2017 DIRECTOR DE LA TESI Horacio Saggion Departament DTIC “output” — 2017/7/10 — 11:47 — page ii — #2 “output” — 2017/7/10 — 11:47 — page iii — #3 To Carla, La estrella que iluminó mi camino cuando más ciego estuve. iii “output” — 2017/7/10 — 11:47 — page iv — #4 “output” — 2017/7/10 — 11:47 — page v — #5 Acknowledgements When I wrote the Acknowledgements section of my 2-year Master’s thesis I had to cut it down a lot after I had finished. There were so many influencers I wanted to explicitly thank for helping me in that journey that I went way beyond 2 pages. Imagine how hard it is to put these acknowledgements together when it comes to a 4-year PhD program. Let’s start from the beginning, then. First and foremost, I would like to express my sincerest gratitude to professor Horacio Saggion. I always considered myself a lucky linguist who got into an NLP program thanks to Horacio being adventurous enough to agree to supervise me and provide all the support I needed since the beginning. I am right now a better researcher, better programmer, write better and have better methodological skills than I would have ever imagined, and the seed for all this was planted by prof. Saggion. It has been an honour to be his PhD student and I hope this is the beginning of many more future collaborations. I would also like to thank all those who have co-authored a paper with me in the last years. Starting with Horacio, but also Francesco Ronzano, Claudio Delli Bovi, Mohamed Sordo, Xavier Serra, Sara Rodríguez Fernández, Aonghus Lawlor, Ahmed AbuRa’ed, José Camacho-Collados, Francesco Barbieri, the guys from Savana (Jorge, Alberto, Nacho Medrano, Nacho Salcedo) and Alberto Pardo. I would also like to explicitly thank professor Roberto Navigli for hosting me in 2015 in the Linguistic Computing Laboratory, at the Sapienza University of Rome. My research career and publication record would not be what it is today without having had this opportunity. I find the BabelNet project a seminal work in knowledge representation, and I plan to continue contributing to it. By extension, I also would like to thank the European Network of e-Lexicography for funding my stay, specifically professors Simon Krek, Carole Tiberius, Tanneke Schooenheim and Ilan Kernerman, with whom I have had very invigorating conversations. Very special thanks also to professor Leo Wanner, with whom I have had a lot of discussions on language, lexicography and compositional meaning. I have always felt warmly welcome at the TALN group, and Leo has made this possible with his expert feedback, and also helping me in everything I needed. Of course, a very special mention to Francesco Barbieri, Joan Soler, Sergio Oramas, Miguel Ballesteros, Alp Öktem and Roberto Carlini. These last years have been a lot of fun, without you guys this would have probably not been the same. And finally, I wish to thank my wife Carla and my parents Luis and Vibeke for putting up with me during all these years, for being supportive and for trying their best to motivate me when the light at the end of the tunnel could barely be seen. All the good things I have done in life are for the most part because you have given me love and taught me what really matters in life. I love you. v “output” — 2017/7/10 — 11:47 — page vi — #6 “output” — 2017/7/10 — 11:47 — page vii — #7 Abstract Natural Language Processing (NLP) is the branch of Artificial Intelligence aimed at understanding and generating language as close as possible to a human’s. To- day, NLP benefits substantially of large amounts of unnanotated corpora with which it derives state-of-the-art resources for text understanding such as vecto- rial representations or knowledge graphs. In addition, NLP also leverages struc- tured and semi-structured information in the form of ontologies, knowledge bases (KBs), encyclopedias or dictionaries. In this dissertation, we present several im- provements in NLP tasks such as Definition and Hypernym Extraction, Hypernym Discovery, Taxonomy Learning or KB construction and completion, and in all of them we take advantage of knowledge repositories of various kinds, showing that these are essential enablers in text understanding. Conversely, we use NLP tech- niques to create, improve or extend existing repositories, and release them along with the associated code for the use of the community. Resumen El Procesamiento del Lenguaje Natural (PLN) es la rama de la Inteligencia Ar- tificial que se ocupa de la comprensión y la generación de lenguage, tomando como referencia el lenguaje humano. Hoy, el PLN se basa en gran medida en la explotación de grandes cantidades de corpus sin anotar, a partir de los cuales se derivan representaciones de gran calidad para la comprensión automática de texto, tales como representaciones vectoriales o grafos de conocimiento. Además, el PLN también explota información estructurada y parcialmente estructurada como ontologías, bases de conocimiento (BCs), enciclopedias o diccionarios. En esta tesis presentamos varias mejoras del estado del arte en tareas de PLN tales como la extracción de definiciones e hiperónimos, descubrimiento de hiperóni- mos, inducción de taxonomías o construcción y enriquecimiento de BCs, y en todas ellas incorporamos repositorios de varios tipos, evaluando su contribución en diferentes áreas del PLN. Por otra parte, también usamos técnicas de PLN para crear, mejorar o extender repositorios ya existentes, y los publicamos junto con su código asociado con el fin de que sean de utilidad para la comunidad. vii “output” — 2017/7/10 — 11:47 — page viii — #8 “output” — 2017/7/10 — 11:47 — page ix — #9 Contents Figure Index xvi Table Index xviii 1 PREAMBLE 1 1.1 Text is the Biggest of Data . .1 1.2 Defining and Representing Knowledge . .4 1.2.1 Structured Knowledge Resources . .4 1.2.1.1 Lexicographic Resources . .5 1.2.1.2 Lexical Databases and Thesauri . .7 1.2.1.3 Knowledge Bases . .8 1.2.2 Unstructured Resources . 13 1.2.2.1 Distributed Semantic Models . 13 1.2.3 Semi-Structured Resources . 14 1.2.3.1 Wikipedia . 14 1.3 Conclusion and a critical view . 14 1.3.1 Current limitations . 14 1.3.2 Research goals . 16 1.3.3 Organization of the thesis . 17 1.4 Our contribution . 18 1.4.1 Publication Record . 19 2 RELATED WORK 23 2.1 Definition Extraction . 23 2.1.1 Rule-Based approaches . 24 2.1.2 Machine Learning Approaches . 26 2.1.2.1 Supervised . 26 2.1.2.2 Unsupervised . 29 2.1.3 Conclusion . 30 2.2 Hypernym Discovery . 30 2.2.1 Pattern-based approaches . 31 ix “output” — 2017/7/10 — 11:47 — page x — #10 2.2.2 Distributional approaches . 33 2.2.3 Combined approaches . 34 2.2.4 Conclusion . 35 2.3 Taxonomy Learning . 35 2.3.1 Corpus-based Taxonomy Learning . 35 2.3.1.1 SemEval Taxonomy Learning Tasks: 2015-16 . 38 2.3.2 Knowledge-based Taxonomy Learning . 39 2.3.3 Conclusion . 40 3 DEFINITION EXTRACTION 43 3.1 DependencyDE: Applying Dependency Relations to Definition Extraction . 44 3.1.1 Data Modeling . 44 3.1.2 Features . 46 3.1.3 Evaluation . 49 3.1.4 Conclusion . 50 3.2 SemanticDE: Definition Extraction Using Sense-based Embeddings 51 3.2.1 Entity Linking . 51 3.2.2 Sense-Based Distributed Representations of Definitions . 52 3.2.3 Sense-based Features . 53 3.2.4 Evaluation . 55 3.2.5 Conclusion . 56 3.3 SequentialDE: Description and Evaluation of a DE system for the Catalan language . 57 3.3.1 Creating a Catalan corpus for DE . 57 3.3.2 Data Modeling . 58 3.3.3 Evaluation . 61 3.3.4 Conclusion . 64 3.4 WeakDE: Weakly Supervised Definition Extraction . 65 3.4.1 Corpus compilation . 65 3.4.2 Data modeling . 67 3.4.3 Bootstrapping . 68 3.4.4 Evaluation . 70 3.4.5 Feature analysis . 72 3.4.6 Conclusion . 73 4 HYPERNYM DISCOVERY 75 4.1 DefinitionHypernyms: Combining CRF and Dependency Gram- mar ................................. 75 4.1.1 Features . 76 4.1.2 Recall-Boosting heuristics . 77 x “output” — 2017/7/10 — 11:47 — page xi — #11 4.1.3 Evaluation . 79 4.1.3.1 Information Gain . 82 4.1.4 Conclusions . 82 4.2 TaxoEmbed: Supervised Distributional Hypernym Discovery via Domain Adaption . 84 4.2.1 Preliminaries . 84 4.2.2 Training Data . 85 4.2.3 TaxoEmbed Algorithm . 85 4.2.3.1 Domain Clustering . 86 4.2.3.2 Training Data Expansion . 87 4.2.3.3 Learning a Hypernym Discovery Matrix . 87 4.2.4 Evaluation . 88 4.2.4.1 Experiment 1: Automatic Evaluation . 89 4.2.4.2 Experiment 2: Extra-Coverage . 92 4.2.5 Conclusion . 94 5 TAXONOMY LEARNING 95 5.1 ExTaSem! Extending, Taxonomizing and Semantifying Domain Terminologies . 95 5.1.1 Domain Definition Harvesting . 96 5.1.2 Hypernym Extraction . 99 5.1.3 Fine-Graining Hyponym - Hypernym Pairs . 99 5.1.4 Path Weighting and Taxonomy Induction . 100 5.1.5 Evaluation . 101 5.1.5.1 Reconstructing a Gold Standard . 103 5.1.5.2 Taxonomy Quality . 105 5.1.6 Conclusion . 107 6 CREATION, ENRICHMENT AND UNIFICATION OF KNOWL- EDGE RESOURCES 109 6.1 MKB: Creating a Music Knowledge Base from Scratch . 109 6.1.1 Background . 110 6.1.2 Methodology .