Building Knowledge Graphs: Marcus Klang

Building Knowledge Graphs: Processing Infrastructure and Named Entity Linking Marcus Klang Doctoral Dissertation, 2019 Department of Computer Science Lund University ISBN 978-91-7895-286-1 (printed version) ISBN 978-91-7895-287-8 (electronic version) ISSN 1404-1219 LU-CS-DISS: 2019-04 Dissertation 64, 2019 Department of Computer Science Lund University Box 118 SE-221 00 Lund Sweden Email: [email protected] WWW: http://cs.lth.se/marcus-klang Typeset using LATEX Printed in Sweden by Tryckeriet i E-huset, Lund, 2019 © 2019 Marcus Klang Abstract Things such as organizations, persons, or locations are ubiquitous in all texts cir- culating on the internet, particularly in the news, forum posts, and social media. Today, there is more written material than any single person can read through during a typical lifespan. Automatic systems can help us amplify our abilities to find relevant information, where, ideally, a system would learn knowledge from our combined written legacy. Ultimately, this would enable us, one day, to build automatic systems that have reasoning capabilities and can answer any question in any human language. In this work, I explore methods to represent linguistic structures in text, build processing infrastructures, and how they can be combined to process a compre- hensive collection of documents. The goal is to extract knowledge from text via things, entities. As text, I focused on encyclopedic resources such as Wikipedia. As knowledge representation, I chose to use graphs, where the entities corre- spond to graph nodes. To populate such graphs, I created a named entity linker that can find entities in multiple languages such as English, Spanish, and Chi- nese, and associate them to unique identifiers. In addition, I describe a published state-of-the-art Swedish named entity recognizer that finds mentions of entities in text that I evaluated on the four majority classes in the Stockholm-Umeå Corpus (SUC) 3.0. To collect the text resources needed for the implementation of the algorithms and the training of the machine-learning models, I also describe a document representation, Docria, that consists of multiple layers of annotations: A model capable of representing structures found in Wikipedia and beyond. Finally, I describe how to construct processing pipelines for large-scale processing with Wikipedia using Docria. Contents Preface v Acknowledgements ix Popular Science Summary in Swedish xi Introduction 1 1 Introduction ............................. 1 2 Natural Language Processing (NLP) ................ 4 3 Corpus ................................ 9 4 Infrastructure ............................ 15 5 Evaluation .............................. 27 6 Machine Learning .......................... 33 7 Data Representation for Machine Learning ............. 36 8 Models ............................... 42 9 Document Database ......................... 48 10 Named Entity Recognition ..................... 50 11 Named Entity Linking ....................... 52 12 Conclusion ............................. 56 Bibliography ............................... 58 Paper I – Named Entity Disambiguation in a Question Answering System 67 Paper II – WIKIPARQ: A Tabulated Wikipedia Resource Using the Par- quet Format 71 Paper III – Docforia: A Multilayer Document Model 81 Paper IV – Multilingual Supervision of Semantic Annotation 87 Paper V – Langforia: Language Pipelines for Annotating Large Collec- tions of Documents 99 iv Contents Paper VI – Overview of the Ugglan Entity Discovery and Linking System105 Paper VII – Linking, Searching, and Visualizing Entities in Wikipedia 119 Paper VIII – Comparing LSTM and FOFE-based Architectures for Named Entity Recognition 127 Paper IX – Docria: Processing and Storing Linguistic Data with Wikipedia133 Paper X – Hedwig: A Named Entity Linker 141 Preface List of Included Publications I Named entity disambiguation in a question answering system. Marcus Klang and Pierre Nugues In Proceedings of the The Fifth Swedish Language Technology Conference (SLTC 2014), Uppsala, November 13-14 2014. II WIKIPARQ: A tabulated Wikipedia resource using the Parquet format. Marcus Klang and Pierre Nugues. In Proceedings of the Ninth International Conference on Language Re- sources and Evaluation (LREC 2016), pages 4141–4148, Portoro, Slovenia, May 2016. III Docforia: A multilayer document model. Marcus Klang and Pierre Nugues. In Proceedings of The Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, November 2016. IV Multilingual supervision of semantic annotation.. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1007–1017, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. V Langforia: Language pipelines for annotating large collections of documents. Marcus Klang and Pierre Nugues. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pages 74–78, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. VI Overview of the Ugglan entity discovery and linking system. Marcus Klang, Firas Dib, and Pierre Nugues. vi Preface In Proceedings of the Tenth Text Analysis Conference (TAC 2017), Gaithers- burg, Maryland, November 2017. VII Linking, searching, and visualizing entities in Wikipedia. Marcus Klang and Pierre Nugues. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 3426–3432, Miyazaki, Japan, May 7-12, 2018 2018. Euro- pean Language Resources Association (ELRA). VIII Comparing LSTM and FOFE-based architectures for named entity Marcus Klang and Pierre Nugues. In Proceedings of the The Seventh Swedish Language Technology Confer- ence (SLTC 2018), pages 54–57, Stockholm, October 7-9 2018. IX Docria: Processing and storing linguistic data with Wikipedia. Marcus Klang and Pierre Nugues. In Proceedings of the 22nd Nordic Conference on Computational Linguis- tics, Turku, October 2019. X Hedwig: A Named Entity Linker Marcus Klang To be submitted. Contribution Statement Marcus Klang is the main contributor to all the papers included in this doctoral thesis when listed as first author. He was the main designer and implementor of the research experiments and responsible for most of the writing. In the paper Overview of the Ugglan Entity Discovery and Linking System, Firas Dib contributed the FOFE-based named entity recognizer that was used as part of the mention detection system. Peter Exner was the main contributor in the papers A Distant Supervision Approach to Semantic Role Labeling and Multi- lingual Supervision of Semantic Annotation, with Marcus Klang contributing the named entity linker used to produce part of the input to the system. In the paper Linking, Searching, and Visualizing Entities for the Swedish Wikipedia, Marcus Klang contributed infrastructure tools and resources. The supervisor Prof. Pierre Nugues contributed to the design of the experiments, writing of articles, and reviewed the content of the papers. vii List of Additional Publications The following papers were related, but not included in this thesis. Specifically, papers XI, XIII, XIV were succeeded by paper IV. XI Using distant supervision to build a proposition bank. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of the The Fifth Swedish Language Technology Conference (SLTC 2014), Uppsala, November 13-14 2014. XII A platform for named entity disambiguation. Marcus Klang and Pierre Nugues. In Proceedings of the workshop on semantic technologies for research in the humanities and social sciences (STRiX), Gothenburg, November 24-25 2014. XIII A distant supervision approach to semantic role labeling. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of the Fourth Joint Conference on Lexical and Computa- tional Semantics, pages 239–248, Denver, Colorado, June 2015. Associa- tion for Computational Linguistics. XIV Generating a Swedish semantic role labeler. Peter Exner, Marcus Klang, and Pierre Nugues. In Proceedings of The Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, November 2016. XV Linking, searching, and visualizing entities for the Swedish Wikipedia. Anton Södergren, Marcus Klang, and Pierre Nugues. In Proceedings of The Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, November 2016. XVI Pairing wikipedia articles across languages. Marcus Klang and Pierre Nugues. In Proceedings of the Open Knowledge Base and Question Answering Work- shop (OKBQA 2016), pages 72–76, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. Acknowledgements First, I would like to express my deepest gratitude to my supervisor, Professor Pierre Nugues. For his steadfast support, inspiration, patience, and outstanding ability to bring clarity when you need it the most. I am most grateful that he introduced me to the field of natural language processing. I also would like to thank my colleagues in the Department of Computer Sci- ence. For their inspiration, many interesting discussions, and their ability to sus- tain a wonderful, curious atmosphere that makes it an excellent workplace. Specif- ically, I want to thank Peter Möller, Lars Nilsson, Anders Bruce, and Tomas Richter

Building Knowledge Graphs: Marcus Klang

Klicksafe-Zusatzmodul "Wikipedia – Gemeinsam Wissen Gestalten"

Offline Product / Content Experience on the Ground Research Reports Money

Working-With-Mediawiki-Yaron-Koren.Pdf

A Tabulated Wikipedia Resource Using the Parquet Format Klang

Těžba Grafových Dat Z Wikipedie

Wikimedia Foundation Quarterly Report

Wikimedia Free Videos Download

Wikipedia @ 20

0A Front Matter-Copyright-Approval-Title Page

From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts

Industry Map: Offline Wikimedia March 15, 2017 So What Is This?

A Tabulated Wikipedia Resource Using the Parquet Format