Graph-Based Methods for Large-Scale Multilingual Knowledge Integration
Total Page:16
File Type:pdf, Size:1020Kb
demelo_cover:Layout 1 13.01.2012 10:35 Seite 1 Dissertationen aus der Naturwissenschaftlich- Technischen Fakultät I der Universität des Saarlandes Graph-based Methods for Large-Scale n o i t a r Multilingual Knowledge Integration Given that much of our knowledge is expressed in textual form, infor - g e mation systems increasingly depend on knowledge about words and t n the entities they represent. This book investigates novel methods for I e automatically building large repositories of knowledge that capture g d semantic relationships between words, names, and entities, in many e l different languages. Three major new contributions are presented, w o each involving graph algorithms and statistical techniques that com - n bine evidence from multiple sources of information. K Gerard de Melo l a u The lexical integration method involves learning models that disam - g n i biguate word meanings based on contextual information in a graph, l i t thereby providing a means to connect words to the entities that they l denote. The entity integration method combines semantic items from u M different sources into a single unified registry of entities by reconciling e l equivalence and distinctness information and solving a combinatorial a c optimization problem. Finally, the taxonomic integration method adds S - a comprehensive and coherent taxonomic hierarchy on top of this e g registry, capturing how different entities relate to each other. r a L Together, these methods can be used to produce a large-scale multi - r o lingual knowledge base semantically describing over 5 million entities f s and over 16 million natural language words and names in more than d o 200 different languages. h t e M d e s a b - h p a r G universaar Universitätsverlag des Saarlandes Saarland University Press Presses Universitaires de la Sarre Gerard de Melo Graph-based Methods for Large-Scale Multilingual Knowledge Integration universaar Universitätsverlag des Saarlandes Saarland University Press Presses Universitaires de la Sarre D 291 © 2012 universaar Universitätsverlag des Saarlandes Saarland University Press Presses Universitaires de la Sarre Postfach 151150, 66041 Saarbrücken ISBN 978-3-86223-028-0 gedruckte Ausgabe ISBN 978-3-86223-029-7 Online-Ausgabe URN urn:nbn:de:bsz:291-universaar-278 Doctoral Dissertation, Saarland University Dissertation zur Erlangung des Grades des Doktors der Ingenieurwissenschaften (Dr.-Ing.) der Naturwissenschaftlich-Technischen Fakultäten der Universität des Saarlandes Defense: Saarbrücken, 15 December 2010 Committee: Gert Smolka, Gerhard Weikum, Hans Uszkoreit, Hinrich Schütze, Sebastian Michel Dean: Holger Hermanns Projektbetreuung universaar : Isolde Teufel Satz: Gerard de Melo Umschlaggestaltung: Julian Wichert Bibliografische Information der Deutschen Nationalbibliothek: Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über <http://dnb.d-nb.de> abrufbar. Abstract Given that much of our knowledge is expressed in textual form, inform- ation systems are increasingly dependent on knowledge about words and the entities they represent. This thesis investigates novel methods for automatically building large repositories of knowledge that capture semantic relationships between words, names, and entities, in many different languages. Three major contributions are made, each involving graph algorithms and statistical techniques that combine evidence from multiple sources of information. The lexical integration method involves learning models that dis- ambiguate word meanings based on contextual information in a graph, thereby providing a means to connect words to the entities that they denote. The entity integration method combines semantic items from different sources into a single unified registry of entities by reconciling equivalence and distinctness information and solving a combinatorial optimization problem. Finally, the taxonomic integration method adds a comprehensive and coherent taxonomic hierarchy on top of this registry, capturing how different entities relate to each other. Together, these methods can be used to produce a large-scale multi- lingual knowledge base semantically describing over 5 million entities and over 16 million natural language words and names in more than 200 different languages. iii Kurzfassung Da ein großer Teil unseres Wissens in textueller Form vorliegt, sind In- formationssysteme in zunehmendem Maße auf Wissen uber¨ Worter¨ und den von ihnen reprasentierten¨ Entitaten¨ angewiesen. Gegenstand dieser Arbeit sind neue Methoden zur automatischen Erstellung großer multi- lingualer Wissensbanken, welche semantische Beziehungen zwischen Wortern,¨ Namen und Entitaten¨ formal erfassen. In drei Hauptbeitragen¨ werden jeweils Indizien aus mehreren Wissensquellen mittels graph- theoretischer und statistischer Verfahren verknupft.¨ Bei der lexikalischen Integration werden statistische Modelle zur Dis- ambiguierung erlernt, um Worter¨ mit den von ihnen reprasentierten¨ En- titaten¨ in Verbindung zu setzen. Bei der Entitaten-Integration¨ werden se- mantische Einheiten aus verschiedenen Quellen unter Berucksichtigung¨ von Aquivalenz¨ und Verschiedenheit durch Losung¨ eines kombinato- rischen Optimierungsproblems zu einem koharenten¨ Register von En- titaten¨ zusammengefasst. Dieses wird schließlich bei der taxonomischen Integration durch eine umfassende taxonomische Hierarchie erganzt,¨ in der Entitaten¨ zueinander in Verbindung gesetzt werden. Es zeigt sich, dass diese Methoden zusammen zur Induzierung einer großen multilingualen Wissensbank eingesetzt werden konnen,¨ welche uber¨ 5 Millionen Entitaten¨ und uber¨ 16 Millionen Worter¨ und Namen in mehr als 200 Sprachen semantisch beschreibt. v Summary Much of our knowledge is expressed in textual form, and it is by keywords that humans most commonly search for information. In- formation systems are increasingly facing the challenge of making sense of words or names of objects, and often rely on background know- ledge about them. While such knowledge can be encoded manually, this thesis examines to what extent existing knowledge sources on the Web and elsewhere can be used to automatically derive much larger knowledge bases that capture explicit semantic relationships between words, names, and entities, in many different languages. At an abstract level, such knowledge bases correspond to labelled graphs with nodes representing arbitrary entities and arcs representing typed relationships between them. The problem is approached from three complementary angles, leading to three novel methods to produce large-scale multilin- gual knowledge bases. In each case, graph algorithms and statistical techniques are used to combine and integrate evidence from multiple existing sources of information. The lexical integration method considers the task of connecting words in different languages to the entities that they denote. Translations and synonyms in a graph are used to determine potential entities corres- ponding to the meanings of a word. The main challenge is assessing which ones are likely to be correct, which is tackled by learning statist- ical disambiguation models. These models operate in a feature space that reflects local contextual information about a word in the graph. This strategy allows us to turn an essentially monolingual resource like the commonly used WordNet database into a much larger multilingual lexical knowledge base. vii viii The entity integration method addresses the problem of extending the range of potential entities that words can refer to by adding large numbers of further entities from separate sources. Given some prior knowledge or heuristics that reveal equivalence as well as distinctness information between entities from one or more knowledge sources, the aim is to combine the different repositories into a single unified registry of entities. Semantic duplicates should be unified, while distinct items should be kept as separate entities. Reconciling conflicting equivalence and distinctness information can be modelled as a combinatorial optim- ization task. An algorithm with a logarithmic approximation guarantee is developed that uses linear programming and region growing to obtain a consistent registry of entities from over 200 language-specific editions of Wikipedia. Finally, the taxonomic integration method adds another layer of or- ganization to this registry of entities, based on taxonomic relationships that connect instances to their classes and classes to parent classes. The central challenge is to combine unreliable and incomplete taxonomic links into a single comprehensive taxonomic hierarchy, which captures how entities in the knowledge base relate to each other. We achieve this by relying on a new Markov chain algorithm. Together, these methods can be used to produce a large-scale multilin- gual knowledge base that substantially goes beyond previous resources by semantically describing over 5 million entities and over 16 million natural language words and names in more than 200 different languages. Zusammenfassung Da ein großer Teil des menschlichen Wissens in textueller Form vorliegt, und auch die Informationssuche primar¨ durch Suchbegriffe erfolgt, sind Informationssysteme in zunehmendem Maße darauf angewiesen, Worter¨ und andere Begriffe semantisch interpretieren zu konnen,¨ oft- mals unter Zuhilfenahme von Hintergrundwissen. Gegenstand dieser Arbeit ist die Frage, inwiefern große multilinguale Wissendatenban- ken automatisch