From Information Extraction to Knowledge Discovery: Semantic Enrichment of Multilingual Content with Linked Open Data

From Information Extraction to Knowledge Discovery: Semantic Enrichment of Multilingual Content with Linked Open Data Thèseprésentéeen vue de l'obtention du grade académiquede Docteur en Information et communication [Max De Wilde] Promoteur Prof. Isabelle Boydens Co-promoteur Prof. Pierrette Bouillon Annéeacadémique 2015{2016 “Prise entre deux langues, la pensée n’est pas réduite à un double servage ; au contraire, elle tire de cette confrontation permanente un regain de liberté.” Hélène Monsacré “We don’t need more information. We need more meaning.” Paul Salopek Contents Introduction......................................1 1 Motivation..............................3 2 Objectives..............................7 3 Structure............................... 15 I Information Extraction.............................. 19 1 Background............................. 21 2 Named-Entity Recognition..................... 29 3 Relations and Events........................ 40 II Semantic Enrichment with Linked Data................... 51 1 Making Sense of the Web...................... 53 2 Semantic Resources......................... 64 3 Enriching Content.......................... 76 III The Humanities and Empirical Content................... 89 1 Empirical Information....................... 91 2 Digital Humanities......................... 96 3 Historische Kranten......................... 107 IV Quality, Language, and Time........................... 123 1 Data Quality............................. 125 2 Multilingualism........................... 137 3 Language Evolution......................... 147 V Knowledge Discovery................................ 157 1 MERCKX: A Knowledge Extractor................. 159 2 Evaluation.............................. 175 3 Validation.............................. 190 Conclusions...................................... 207 1 Overview............................... 209 2 Outcomes.............................. 213 3 Perspectives............................. 218 A Source Code...................................... 223 B Guidelines for Annotators............................ 225 C Follow-up........................................ 226 Detailed Contents.................................. 227 Bibliography...................................... 233 i Acknowledgements I would like to express my gratitude on the one hand to Pierrette Bouillon whose Master course of Ingénierie linguistique allowed me to get acquainted with natural language processing – and who was also the first one to suggest I take the job as a teaching assistant which financed this thesis – and on the other hand to Isabelle Boydens and Françoise D’Hautcourt for making a bet on me after a quick, last-minute interview one evening in late August, 2009. Isabelle and Pierrette proved to be patient, understanding, and complemen- tary supervisors: I sincerely thank them both for their precious counselling and continuing support. My interest in NLP subsequently led me to enrol, concurrently with my doctorate, for an advanced Master in computational linguistics at Universiteit Antwerpen, where Walter Daelemans successfully managed to rid me of all the administrative obstacles I encountered. Under his committed supervision and the mentoring of Roser Morante, I completed a highly rewarding intensive course in biomedical information extraction and attended my first interna- tional workshop in Cambridge in 2011. Later on, in 2013, I met Tobias Blanke during another workshop at the CEGES and was pleased to discover that we shared many research interests, from named-entity recognition to OCR and from the digital humanities to graph-based approaches. Walter and Tobias readily accepted to sit on my thesis committee, which I must admit makes me feel honoured. My thanks go to Liesbeth Thiers, Hilde Cuyt, and Renee Mestdagh from the CO7 Erfgoedcel, and especially to Jochen Vermote from the Ypres City Archive, for welcoming me in their work environment and entrusting me with their dataset and analytics, even allowing me on first encounter to drive home with their 1TB hard drive of data, to be brought back the following week! They had to wait for a few years before getting any tangible results, I hope they are not disappointed with the outcome of the thesis. Thanks also to Robert Tiessen and Walter Tromp from Picturae for inviting me to their headquarters in Heiloo, near Amsterdam, for a presentation and in-depth talk. iii iv Acknowledgements Cheers to my colleagues Seth van Hooland and Ruben Verborgh for all our trips and talks as part of the Free Your Metadata world tour, from 8:00 a.m. pastrami with Amalia in a deserted diner to Marvin Gaye in cheap hotel rooms, and from Whit Monday library sequestration to ukulele recordings in Seth’s living room. Special thanks to Ruben for asking me to be co-author of Using OpenRefine, which was a great collaborative experience and also a (morally, if not financially) rewarding achievement. Thanks to my fellow PhD students Laurence, Raphaël and Simon (in al- phabetical order) for partially relieving me of my teaching and administrative chores in the final months of my redacting, I doubt I could have made it in time without them. Having worked alone for the first four years, I can appreciate the added value of team work and constant brainstorming on matters ranging from the best definition of ontologies to the choice of the cutest lemur picture for a new research project. I am definitely indebted to my family for providing constant support – to my dad in particular for technical assistance, from language detection to Python helpdesk, and to my mum for reading this work three times over without getting bored to death. To my brother Sam and his fiancée Vané then for delaying their wedding just after my submission (or is it the other way round?), and to my brother Tom for our not-so-productive-but-still-rewarding working nights while he wrote his Master thesis. Big thanks also to my other diligent proof-readers: Simon again, Lionel, and my uncle Marc. Of course my gratefulness goes to my partner Hélène for encouraging, sup- porting and putting up with me during these six long years. She anticipated right from the start that I should have begun the redaction much sooner, so I’m extra grateful for her having granted me long undisturbed periods of writ- ing time that proved crucial for the finalisation of this dissertation. I deeply admire her for the patience she could keep with the whole process. I feel these acknowledgements would not be complete without a quick nod to the 7e tasse teashop which provided me the fuel to go on. I obviously cannot count the number of cups gulped since 2009 but 13 000 is my rough guess, which averages to about 50 for every page written. Music-wise, Mr. Django remained as great a source of inspiration in late night hours, as ever. And last but not least, this one goes out to my 2-year-old daughter Jeanne for regularly diverting me from my work, which could sometimes be annoying but in the end proved a blessing in disguise, allowing me to put things into perspective... List of Figures 1 Disambiguation and enrichment with a knowledge base.... 13 I.1 Illustration of the Turing test.................... 22 I.2 Information retrieval........................ 25 I.3 Information extraction....................... 25 I.4 Text mining.............................. 27 I.5 SoNaR named entity typology................... 33 I.6 Illustration of the Ypres Salient.................. 40 I.7 Candidate event for template filling................ 48 II.1 DIKW Pyramid............................ 55 II.2 The Semantic Web layer cake................... 57 II.3 Overly complex TEI workflow................... 59 II.4 Snapshot of the Linked Open Data cloud............. 63 II.5 Preview of the “Ostend” resource in DBpedia.......... 65 II.6 Google’s Knowledge Graph..................... 77 II.7 Arbitrariness of the linguistic sign................. 78 III.1 Close and distant reading...................... 101 III.2 Browsing named entities...................... 102 III.3 The Hype cycle............................ 104 III.4 Historische Kranten homepage................... 107 III.5 Bilingual article from De Handboog................ 116 III.6 Noisy Google Analytics data.................... 119 IV.1 Quality trade-off........................... 127 IV.2 OCRised article from the Messager d’Ypres............ 129 IV.3 Problematic NER due to multilingual ambiguity......... 141 IV.4 Evolution of language........................ 151 IV.5 Relative salience of terms in US inaugural addresses...... 154 V.1 NERD platform........................... 161 v vi List of Figures V.2 Calais Viewer............................. 163 V.3 Babelfy................................ 166 V.4 Natural Language Toolkit pipeline................. 167 V.5 X-Link................................. 168 V.6 SemEval workflow.......................... 175 V.7 From corpus to knowledge..................... 191 V.8 Place Browser............................ 195 V.9 People Finder............................ 196 V.10 Discovering related entities..................... 197 V.11 Geographical dispersion of Perelman’s correspondents..... 205 Unless otherwise stated, all images are either own work, screenshots from referenced websites, or in the public domain. Images under Creative Commons licences are indicated with one of these: CC BY-SA (Attribution – Share Alike), CC BY-NC-SA (Attribution – Non Commercial – Share Alike), or CC BY-NC-ND (Attribution – Non Commercial – No Derivatives). Visit https://creativecommons.org/

Load more