Building Knowledge Graphs Processing Infrastructure and Named Entity Linking Klang, Marcus
Total Page:16
File Type:pdf, Size:1020Kb
Building Knowledge Graphs Processing Infrastructure and Named Entity Linking Klang, Marcus 2019 Document Version: Publisher's PDF, also known as Version of record Link to publication Citation for published version (APA): Klang, M. (2019). Building Knowledge Graphs: Processing Infrastructure and Named Entity Linking. Department of Computer Science, Lund University. Total number of authors: 1 General rights Unless other specific re-use rights are stated the following general rights apply: Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Read more about Creative commons licenses: https://creativecommons.org/licenses/ Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. LUND UNIVERSITY PO Box 117 221 00 Lund +46 46-222 00 00 Building Knowledge Graphs: Processing Infrastructure and Named Entity Linking Marcus Klang Doctoral Dissertation, 2019 Department of Computer Science Lund University ISBN 978-91-7895-286-1 (printed version) ISBN 978-91-7895-287-8 (electronic version) ISSN 1404-1219 LU-CS-DISS: 2019-04 Dissertation 64, 2019 Department of Computer Science Lund University Box 118 SE-221 00 Lund Sweden Email: [email protected] WWW: http://cs.lth.se/marcus-klang Typeset using LATEX Printed in Sweden by Tryckeriet i E-huset, Lund, 2019 © 2019 Marcus Klang Abstract Things such as organizations, persons, or locations are ubiquitous in all texts cir- culating on the internet, particularly in the news, forum posts, and social media. Today, there is more written material than any single person can read through during a typical lifespan. Automatic systems can help us amplify our abilities to find relevant information, where, ideally, a system would learn knowledge from our combined written legacy. Ultimately, this would enable us, one day, to build automatic systems that have reasoning capabilities and can answer any question in any human language. In this work, I explore methods to represent linguistic structures in text, build processing infrastructures, and how they can be combined to process a compre- hensive collection of documents. The goal is to extract knowledge from text via things, entities. As text, I focused on encyclopedic resources such as Wikipedia. As knowledge representation, I chose to use graphs, where the entities corre- spond to graph nodes. To populate such graphs, I created a named entity linker that can find entities in multiple languages such as English, Spanish, and Chi- nese, and associate them to unique identifiers. In addition, I describe a published state-of-the-art Swedish named entity recognizer that finds mentions of entities in text that I evaluated on the four majority classes in the Stockholm-Umeå Corpus (SUC) 3.0. To collect the text resources needed for the implementation of the algorithms and the training of the machine-learning models, I also describe a document repre- sentation, Docria, that consists of multiple layers of annotations: A model capable of representing structures found in Wikipedia and beyond. Finally, I describe how to construct processing pipelines for large-scale processing with Wikipedia using Docria. Contents Preface v Acknowledgements ix Popular Science Summary in Swedish xi Introduction 1 1 Introduction ............................. 1 2 Natural Language Processing (NLP) ................ 4 3 Corpus ................................ 9 4 Infrastructure ............................ 15 5 Evaluation .............................. 27 6 Machine Learning .......................... 33 7 Data Representation for Machine Learning ............. 36 8 Models ............................... 42 9 Document Database ......................... 48 10 Named Entity Recognition ..................... 50 11 Named Entity Linking ....................... 52 12 Conclusion ............................. 56 Bibliography ............................... 58 Paper I – Named Entity Disambiguation in a Question Answering System 67 Paper II – WIKIPARQ: A Tabulated Wikipedia Resource Using the Par- quet Format 71 Paper III – Docforia: A Multilayer Document Model 81 Paper IV – Multilingual Supervision of Semantic Annotation 87 Paper V – Langforia: Language Pipelines for Annotating Large Collec- tions of Documents 99 iv Contents Paper VI – Overview of the Ugglan Entity Discovery and Linking System105 Paper VII – Linking, Searching, and Visualizing Entities in Wikipedia 119 Paper VIII – Comparing LSTM and FOFE-based Architectures for Named Entity Recognition 127 Paper IX – Docria: Processing and Storing Linguistic Data with Wikipedia133 Paper X – Hedwig: A Named Entity Linker 141 Preface List of Included Publications I Named entity disambiguation in a question answering system. Marcus Klang and Pierre Nugues In Proceedings of the The Fifth Swedish Language Technology Conference (SLTC 2014), Uppsala, November 13-14 2014. II WIKIPARQ: A tabulated Wikipedia resource using the Parquet for- mat. Marcus Klang and Pierre Nugues. In Proceedings of the Ninth International Conference on Language Re- sources and Evaluation (LREC 2016), pages 4141–4148, Portoro, Slovenia, May 2016. III Docforia: A multilayer document model. Marcus Klang and Pierre Nugues. In Proceedings of The Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, November 2016. IV Multilingual supervision of semantic annotation.. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1007–1017, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. V Langforia: Language pipelines for annotating large collections of doc- uments. Marcus Klang and Pierre Nugues. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pages 74–78, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. VI Overview of the Ugglan entity discovery and linking system. Marcus Klang, Firas Dib, and Pierre Nugues. vi Preface In Proceedings of the Tenth Text Analysis Conference (TAC 2017), Gaithers- burg, Maryland, November 2017. VII Linking, searching, and visualizing entities in Wikipedia. Marcus Klang and Pierre Nugues. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 3426–3432, Miyazaki, Japan, May 7-12, 2018 2018. Euro- pean Language Resources Association (ELRA). VIII Comparing LSTM and FOFE-based architectures for named entity Marcus Klang and Pierre Nugues. In Proceedings of the The Seventh Swedish Language Technology Confer- ence (SLTC 2018), pages 54–57, Stockholm, October 7-9 2018. IX Docria: Processing and storing linguistic data with Wikipedia. Marcus Klang and Pierre Nugues. In Proceedings of the 22nd Nordic Conference on Computational Linguis- tics, Turku, October 2019. X Hedwig: A Named Entity Linker Marcus Klang To be submitted. Contribution Statement Marcus Klang is the main contributor to all the papers included in this doctoral thesis when listed as first author. He was the main designer and implementor of the research experiments and responsible for most of the writing. In the paper Overview of the Ugglan Entity Discovery and Linking System, Firas Dib contributed the FOFE-based named entity recognizer that was used as part of the mention detection system. Peter Exner was the main contributor in the papers A Distant Supervision Approach to Semantic Role Labeling and Multi- lingual Supervision of Semantic Annotation, with Marcus Klang contributing the named entity linker used to produce part of the input to the system. In the paper Linking, Searching, and Visualizing Entities for the Swedish Wikipedia, Marcus Klang contributed infrastructure tools and resources. The supervisor Prof. Pierre Nugues contributed to the design of the experi- ments, writing of articles, and reviewed the content of the papers. vii List of Additional Publications The following papers were related, but not included in this thesis. Specifically, papers XI, XIII, XIV were succeeded by paper IV. XI Using distant supervision to build a proposition bank. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of the The Fifth Swedish Language Technology Conference (SLTC 2014), Uppsala, November 13-14 2014. XII A platform for named entity disambiguation. Marcus Klang and Pierre Nugues. In Proceedings of the workshop on semantic technologies for research in the humanities and social sciences (STRiX), Gothenburg, November 24-25 2014. XIII A distant supervision approach to semantic role labeling. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of the Fourth Joint Conference on Lexical and Computa- tional