Leveraging Entity Types and Properties for Knowledge Graph Exploitation
Total Page:16
File Type:pdf, Size:1020Kb
DEPARTMENT OF INFORMATICS UNIVERSITYOF FRIBOURG (SWITZERLAND) Leveraging Entity Types and Properties for Knowledge Graph Exploitation THESIS Presented to the Faculty of Science of the University of Fribourg (Switzerland) in consideration for the award of the academic grade of Doctor scientiarum informaticarum by ALBERTO TONON from ITALY Thesis No: 2018 UniPrint 2017 Accepted by the Faculty of Science of the University of Fribourg (Switzerland) upon the recommendation of Prof. Dr. Krisztian Balog and Dr. Gianluca Demartini. Fribourg, May 22, 2017 Thesis supervisor Dean Prof. Dr. Philippe Cudré-Mauroux Prof. Dr. Christian Bochet iii Declaration of Authorship Title: Leveraging Entity Types and Properties for Knowledge Graph Exploitation I, Alberto Tonon, declare that I have authored this thesis independently, without illicit help, that I have not used any other than the declared sources/resources, and that I have explicitly marked all material which has been quoted either literally or by content from the used sources. Signed: Date: v “Educating the mind without educating the heart is no education at all.” — Aristotle Acknowledgments There are many people without whom I would never have achieved this goal. All of them con- tributed in their own way and proved to be essential to my academic growth, to my personal development, and to my mental health. Here I want to thank them all. Thanks to my supervisor, Professor Philippe Cudré-Mauroux. Philippe provided me with inestimable advice and guidance, he also showed great patience, and always understood the issues that each Ph.D. candidate had. One of Philippe’s best achievements is, in my opinion, the creation of a familiar and friendly group where students and researchers can freely interact, provide feedback, collaborate, and have a lot of fun. The path to obtaining a Ph.D. is long and difficult, I think that working in such an environment helped me a lot in dealing with all the difficulties I faced. Special thanks go to Gianluca Demartini and Monica Noselli, my first family in Fribourg. Similarly to Philippe, Gianluca gave me invaluable advice, introduced me to the academic world, came with me to my first conference, and contributed significantly to the research presented in this thesis. I would also like to acknowledge an important component of my daily life in the lab: Dr. Roman Prokofyev. Roman bravely shared an office with me for almost five years, managed to bear me even during deadline marathons and stressful periods. Not less important are all the other members and friends of the lab with whom I shared important moments (in alphabetic order): Alisa, Alyia, Artem, Dingqi, Djellel, Esther, Giuseppe, Ines, Julia, Laura, Marcin, Martin, Michael, Michele, Michelle, Paolo, Ruslan, Sabrina, Victor. You guys are great! My friends in Italy and in Fribourg were also essential during the last five years. I would like, in particular, to mention all friends from “FR Fun!”, and my friends from all times, Alessan- dro, Annalisa, and Lorena. Finally, I owe eternal gratitude to my parents, Gianna and Luigi, my sister Cristiana, my grandparents, Edoardo and Vally, and to Aurora for supporting me unconditionally. They inspired me and helped me becoming what I am. Fribourg, June 2, 2017 Alberto Tonon ix Abstract A Knowledge Graph is a knowledge base containing semi-structured information represented as a graph. Entries (nodes) in a knowledge graph are called entities. Knowledge Graphs today play a central role in the services provided by major Web players. Recent examples include the Google Knowledge Vault, Yahoo!’s Knowledge Graph, and Bing’s Knowledge and Action Graph. In addition, open-source Knowledge Graphs such as DBpedia or Wikidata, are also available for reuse. Being able to tap into a Knowledge Graph and exploit connections among its entities is key to numerous tasks related for example to computational linguistics, information retrieval, and question answering. In this thesis we develop new methods for effectively retrieving entities and for ranking their types. In addition, we present algorithms that improve data quality in knowledge graphs by finding misused entity properties (labeled edges in the Knowledge Graph), and we show how Knowledge Graphs can be used in practice by exploiting them to effectively detect newsworthy events. Finally, we propose a novel evaluation methodology that can be used for continuously evaluating entity-centric retrieval systems. We start this thesis by presenting novel techniques to retrieve entities as responses to keyword queries. This task is called Ad-hoc Object Retrieval, and is essential for effectively exploiting Knowledge Graphs. We show that triplestores can be used to efficiently retrieve relevant responses by exploring the surroundings of entities obtained by applying standard Information Retrieval methods. This leads to significant improvements in average precision and NDCG with little increase in execution time. Subsequently, we introduce the novel task of Type Ranking (TRank), which consists in ranking all types of a given entity based on the textual context in which it appears. Information on entity types that best fit a certain context can be shown to end-users for text understanding, text summarization and for search results diversification. We experimentally demonstrate that approaches for TRank based on the type hierarchy of the Knowledge Graph provide more accurate results. As information on the schema of the Knowledge Graph is exploited by several of our methods, we also studied how to verify the adherence of the relations contained in the knowledge base to their formal specification. Specifically, we propose a method for detecting when properties in the Knowledge Graph are misused or need to be specialized. We present results that show how entropy can be used to detect the misused properties. We then switch to a more applied context by presenting a real world application that makes use of semantic technologies and Knowledge Graphs. The system we developed uses a combination of entity linking, anomaly detection and reasoning to efficiently and effectively detect newsworthy events in microblogs. We show that our xi Abstract system outperforms state-of-the-art methods based on query expansion. Finally, we tackle the problem of how to evaluate entity-centric retrieval systems like those we proposed. In this context, we introduce a new evaluation methodology that uses crowdsourcing to evaluate sets of systems in a fair and continuous fashion, and define techniques to weight the strictness or leniency of different crowds evaluating the retrieved entities. We analyze the benefits and drawbacks of our methodology by comparing AOR systems developed at different points in time and study how standard Information Retrieval metrics, AOR system ranking, and several pooling techniques behave in such a continuous evaluation context. Keywords: Knowledge Graphs, Entities, Entity Types, Data Integration. xii Résumé Un graphe de connaissances est une base de connaissances contenant des informations semi-structurées représentées comme un graphe. Les entrées (noeuds) dans un graphe de connaissances sont appelées entités. De nos jours, les graphes de connaissances jouent un role central dans les servies fournis par les majeurs acteurs du Web. Des examples récents incluent le "Google Knowledge Vault", le graphe de connaissances de Yahoo et le "Knowledge and Action Graph" de Bing. De surcroit, les graphes de connaissances open-source, comme DBpedia ou Wikipedia, sont aussi disponibles pour réutilisation. Etre en mesure de bénéficier des graphes de connaissances et exploiter les connections entre les entités sont la clé pour divers taches reliées, par example au calcul linguistique, l’extraction de connaissances, ou les questions réponses. Dans cette thèse, nous développons des nouvelles méthodes pour la recherche d’entités et pour le classement de leurs types. En plus, nous introduisons des algo- rithmes pour améliorer la qualité des données dans les graphes de connaissances en trouvant les propriétés d’entités mal utilisés (les arêtes étiquetées dans le graphe de connaissance) et nous illustrons comment les graphes de connaissances peuvent être utilisés en pratique en les exploitant dans l’otique de détecter les événements pertinents aux nouvelles (newsworthy). Finalement, nous proposons une nouvelle évaluation méthodologique qui peut être utilisée pour l’évaluation continuelle des entités centrées sur les systèmes d’extraction. Nous commençons cette thèse par présenter les nouvelles techniques pour extraire les entités comme réponses à des requêtes mots-clés. Cette tache, appelée Ad-hoc Object Retrieval, est essentielle pour l’exploitation efficace des graphes de connaissances. Nous montrons aussi que les triplestores peuvent être utilisés pour extraire efficacement les réponses pertinentes en exploitant les voisins des entités obtenus par l’application des méthodes classiques de l’extraction de connaissances. Ceci conduit à des améliorations signifiantes en précision moyenne et NDCG avec une minime augmentation dans le temps d’execution. En consé- quence, nous introduisons une nouvelle tache de classement de type (TRank), qui consiste dans le classement de tous les types d’une certaine entité basée sur le contexte textuel dans lequel elle apparait. Les informations sur les types des entités qui siéent le plus dans un contexte spécifique peuvent être montrées aux utilisateurs finaux pour la compréhension du texte, la récapitulation et la diversification de la recherche