At the Frontier of NLP, Machine Learning and Semantics
Total Page:16
File Type:pdf, Size:1020Kb
2018-ENST-0062 EDITE - ED 130 Doctorat ParisTech THÈSE pour obtenir le grade de docteur délivré par TELECOM ParisTech Spécialité « Computer Science and Multimedia » présentée et soutenue publiquement par Julien PLU le 21 12 2018 Knowledge Extraction in Web Media : At The Frontier of NLP, Machine Learning and Semantics Directeur de thèse : Dr. Raphaël TRONCY Co-encadrement de la thèse : Dr. Giuseppe RIZZO Jury Pierre-Antoine CHAMPIN, Dr HDR, LIRIS, University Claude Bernard Lyon 1, France, reviewer Harald SACK, Prof, FIZ Karlsruhe, Leibniz Institute for Information Infrastructure, Germany, reviewer Anna Lisa GENTILE, Dr, IBM Research Almaden, USA, examiner Andrea TETTAMANZI, Prof, Inria Sophia Antipolis, France, examiner Christian BONNET, Prof, EURECOM, France, president TELECOM ParisTech école de l’Institut Télécom - membre de ParisTech Dedication I dedicate this thesis to my family, and especially to my grandfather Bernard who always pushed me and encouraged me even when I did not believe in myself anymore. Acknowledgements Thanks to my interested, encouraging and enthusiastic father: he was always keen to know what I was doing and how, although it was difficult to explain him what my work was all about! I am grateful to all my family members and friends who have supported me until the end. Thanks a million to my two great supervisors Prof. Raphaël Troncy and Dr. Giuseppe Rizzo for their amazing supervision and always valuable advice to continue to improve this work again and again. A special thanks goes to the 3cixty project that has provided the funding for this thesis. A special mention to my group members: Jose, Ghislain, Enrico and Pasquale who became good friends along the years. It was fantastic to have the opportunity to work at your side. I would like, as well, to address all my gratitude to my sister of heart, Natacha. I will miss our long talks in my office. Thanks also to my jury members for all their great comments: Dr. Pierre- Antoine CHAMPIN, Prof. Harald SACK, Dr. Anna Lisa GENTILE, Prof. Andrea TETTAMANZI and Prof. Christian BONNET. And finally, last but by not least, also to everyone at EURECOM, it was great to share the laboratory with all of you during these years. Abstract The Web offers a vast amount of structured and unstructured content from which more and more advanced techniques are developed for extracting entities and rela- tions between entities, one of the key elements for feeding the various knowledge graphs that are being developed by major Web companies as part of their product offerings. Most of the knowledge available on the Web is present as natural language text enclosed in Web documents aimed at human consumption. A common approach for obtaining programmatic access to such a knowledge uses information extraction techniques. It reduces texts written in natural languages to machine readable struc- tures, from which it is possible to retrieve entities and relations, for instance obtaining answers to database-style queries. This thesis aims to contribute to the emerging idea that entities should be a first class citizen on the Web. A common research line con- sists in annotating texts such as users’ posts, item descriptions, video subtitles, with entities that are uniquely identified in some knowledge bases as part of the Global Gi- ant Graph. The Natural Language Processing (NLP) community has been addressing this crucial task for the past few decades. As a result, the community has established gold standards and metrics to evaluate the performance of algorithms in important tasks such as co-reference Resolution, Named Entity Recognition, Entity Linking and Relationship Extraction, just to mention few examples. Some of these topics overlap with research that the Database Systems and more recently the Knowledge Engineer- ing communities have been addressing also for decades, such as Identity Resolution (deduplication, entity resolution, record linkage), Schema Mapping (schema medi- ation, ontology matching) and Data Fusion (instance matching, data interlinking). Meanwhile the Semantic Web and Linked Data communities have been addressing questions related to how to model, serialize and share such information on the Web, as well as how to use knowledge described in more expressive formalisms for a vari- ety of integration, retrieval and discovery tasks. Finally, the Information Retrieval community has been increasingly paying attention to the intersection of structured and unstructured data, with topics encompassing entity-oriented search and semantic search. This thesis sits at the intersection of several research fields including semantic web, natural language processing and information retrieval, with an emphasis on how to disambiguate entities when machine reading texts. Two tasks have been researched: vi Contents Word Sense Disambiguation to infer the sense of ambiguous words and Entity Link- ing which is used to explore the correct reference of Named Entities occurring in documents from referent knowledge bases. We identify four main challenges when de- veloping an entity linking system: i) the kind of textual documents to annotate (such as social media posts, video subtitles or news articles); ii) the number of types used to categorise an entity (such as PERSON, LOCATION, ORGANIZATION, DATE or ROLE); iii) the knowledge base used to disambiguate the extracted mentions (such as DBpedia, Wikidata or Musicbrainz); iv) the language used in the documents. Our main contribution is ADEL, an adaptable entity recognition and linking hybrid frame- work using linguistic, information retrieval, and semantics-based methods. ADEL is a modular system that is independent to the kind of text to be processed and to the knowledge base used as referent for disambiguating entities. We thoroughly evaluate the framework on numerous benchmark datasets. Our evaluation shows that ADEL outperforms state-of-the-art systems in terms of extraction and entity typing. It also shows that our indexing approach allows to generate an accurate set of candidates from any knowledge base that makes use of linked data, respecting the required in- formation for each entity, in a minimum of time and with a minimal size. The solution to these problems impacts other related domains such as discourse summary, search engine relevance improvement, anaphora resolution and question answering. Contents Dedication.....................................i Acknowledgements................................ iii Abstract......................................v Contents......................................x List of Figures.................................. xi List of Tables................................... xiv List of Listings.................................. xvi I Thesis Content xvii 1 Introduction1 1.1 Context and Motivations..........................1 1.2 Entity Linking at The Age of Media Diversity..............4 1.3 Hypothesis and Research Questions....................5 1.4 Contributions................................6 1.5 Outline...................................7 2 State of the Art8 2.1 Preliminaries For Information Extraction.................8 2.1.1 Definitions.............................8 2.1.2 Textual Content.......................... 23 2.1.3 Linked Data and Non-Linked Data Knowledge Bases...... 23 2.2 Entity Recognition Architectures..................... 26 2.2.1 Rules-based Approaches...................... 26 2.2.2 Learning Approaches........................ 27 2.3 Coreference Resolution........................... 32 2.4 Entity Linking Architectures........................ 34 2.4.1 Candidate Generation....................... 34 2.4.2 Entity Selection or Ranking.................... 35 2.4.3 Independent Approach....................... 36 2.4.4 Collective Approach........................ 37 2.4.5 NIL Clustering........................... 40 viii Contents 2.5 Approaches Summary........................... 41 3 Entity Recognition 46 3.1 Social Media Preprocessing........................ 46 3.1.1 Hashtag Segmentation....................... 46 3.1.2 User Mention Dereferencing.................... 48 3.2 Aggregation of Multiple Named Entity Extractors............ 48 3.2.1 Dictionary Tagger......................... 49 3.2.2 Part-Of-Speech Tagger....................... 50 3.2.3 Named Entity Recognition Tagger................ 52 3.2.4 Coreference Tagger......................... 54 3.3 Overlap Resolution............................. 60 3.3.1 Types Mapping........................... 60 3.3.2 Majority Voting.......................... 61 4 Entity Linking 64 4.1 Candidate Generation........................... 64 4.1.1 Linked Data Indexing: Use Cases with DBpedia and Musicbrainz 64 4.1.2 Non Linked Data Indexing: Use Cases With The Lexical Net- work JeuxDeMots......................... 65 4.1.3 Index Optimization......................... 67 4.2 Linking Methods.............................. 70 4.2.1 Text Similarity Weighted With PageRank............ 70 4.2.2 JeuxDeLiens: Path Finding In A Semantic-lexical Network.. 70 4.2.3 Graph Regularization....................... 73 4.2.4 Heuristic based Candidate Re-Ranking.............. 76 4.2.5 Deep-Similarity Network...................... 78 4.2.6 NIL Clustering........................... 80 5 ADEL: A Scalable Architecture For Entity Linking 82 5.1 Architecture................................. 82 5.1.1 Formats............................... 82 5.1.2 Technologies............................ 84 5.1.3 Project Structure.......................... 85 5.2 Usage.................................... 87 5.2.1 Configuration