Yago: a Core of Semantic Knowledge Unifying Wordnet and Wikipedia Fabian Suchanek, Gjergji Kasneci, Gerhard Weikum

Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia Fabian Suchanek, Gjergji Kasneci, Gerhard Weikum To cite this version: Fabian Suchanek, Gjergji Kasneci, Gerhard Weikum. Yago: A Core of Semantic Knowledge Unify- ing WordNet and Wikipedia. 16th international conference on World Wide Web, May 2007, Banff, Canada. pp.697 - 697, 10.1145/1242572.1242667. hal-01472497 HAL Id: hal-01472497 https://hal.archives-ouvertes.fr/hal-01472497 Submitted on 20 Feb 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia Fabian M. Suchanek Gjergji Kasneci Gerhard Weikum Max-Planck-Institut Max-Planck-Institut Max-Planck-Institut Saarbrucken¨ / Germany Saarbrucken¨ / Germany Saarbrucken¨ / Germany suchanekaOmpii.mpg.de kasneciaOmpii.mpg.de weikumaOmpii.mpg.de ABSTRACT Furthermore, ontological knowledge structures play an im- We present YAGO, a light-weight and extensible ontology portant role in data cleaning (e.g., for a data warehouse) [6], with high coverage and quality. YAGO builds on entities record linkage (aka. entity resolution) [7], and information and relations and currently contains more than 1 million integration in general [19]. entities and 5 million facts. This includes the Is-A hierarchy But the existing applications typically use only a single as well as non-taxonomic relations between entities (such source of background knowledge (mostly WordNet [10] or Wikipedia). They could boost their performance, if a huge as hasWonPrize). The facts have been automatically ex- tracted from Wikipedia and unified with WordNet, using ontology with knowledge from several sources was available. a carefully designed combination of rule-based and heuris- Such an ontology would have to be of high quality, with ac- tic methods described in this paper. The resulting knowl- curacy close to 100 percent, i.e. comparable in quality to edge base is a major step beyond WordNet: in quality by an encyclopedia. It would have to comprise not only con- adding knowledge about individuals like persons, organiza- cepts in the style of WordNet, but also named entities like tions, products, etc. with their semantic relationships – and people, organizations, geographic locations, books, songs, in quantity by increasing the number of facts by more than products, etc., and also relations among these such as what- an order of magnitude. Our empirical evaluation of fact cor- is-located-where, who-was-born-when, who-has-won-which- rectness shows an accuracy of about 95%. YAGO is based on prize, etc. It would have to be extensible, easily re-usable, a logically clean model, which is decidable, extensible, and and application-independent. If such an ontology were avail- compatible with RDFS. Finally, we show how YAGO can be able, it could boost the performance of existing applications further extended by state-of-the-art information extraction and also open up the path towards new applications in the techniques. Semantic Web era. Categories and Subject Descriptors 1.2 Related Work H.0 [Information Systems]: General Knowledge representation is an old field in AI and has provided numerous models from frames and KL-ONE to recent variants of description logics and RDFS and OWL General Terms (see [22] and [24]). Numerous approaches have been pro- Knowledge Extraction, Ontologies posed to create general-purpose ontologies on top of these representations. One class of approaches focuses on extract- Keywords ing knowledge structures automatically from text corpora. These approaches use information extraction technologies Wikipedia, WordNet that include pattern matching, natural-language parsing, and statistical learning [25, 9, 4, 1, 23, 20, 8]. These tech- 1. INTRODUCTION niques have also been used to extend WordNet by Wikipedia individuals [21]. Another project along these lines is Know- 1.1 Motivation ItAll [9], which aims at extracting and compiling instances Many applications in modern information technology uti- of unary and binary predicate instances on a very large scale lize ontological background knowledge. This applies above – e.g., as many soccer players as possible or almost all com- all to applications in the vision of the Semantic Web, but pany/CEO pairs from the business world. Although these there are many other application fields. Machine translation approaches have recently improved the quality of their re- (e.g. [5]) and word sense disambiguation (e.g. [3]) exploit sults considerably, the quality is still significantly below that lexical knowledge, query expansion uses taxonomies (e.g. of a man-made knowledge base. Typical results contain [16, 11, 27]), document classification based on supervised or many false positives (e.g., IsA(Aachen Cathedral, City), to semi-supervised learning can be combined with ontologies give one example from KnowItAll). Furthermore, obtaining (e.g. [14]), and [13] demonstrates the utility of background a recall above 90 percent for a closed domain typically en- knowledge for question answering and information retrieval. tails a drastic loss of precision in return. Thus, information- Copyright is held by the International World Wide Web Conference Com- extraction approaches are only of little use for applications mittee (IW3C2). Distribution of these papers is limited to classroom use, that need near-perfect ontologies (e.g. for automated rea- and personal use by others. soning). Furthermore, they typically do not have an explicit WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. (logic-based) knowledge representation model. ACM 978-1-59593-654-7/07/0005. Due to the quality bottleneck, the most successful and sources from which the current YAGO is assembled, namely, widely employed ontologies are still man-made. These in- Wikipedia and WordNet. In Section 4 we give an overview of clude WordNet [10], Cyc or OpenCyc [17], SUMO [18], and the system behind YAGO. We explain our extraction tech- especially domain-specific ontologies and taxonomies such as niques and we show how YAGO can be extended by new SNOMED1 or the GeneOntology2. These knowledge sources data. Section 5 presents an evaluation, a comparison to have the advantage of satisfying the highest quality expecta- other ontologies, an enrichment experiment and sample facts tions, because they are manually assembled. However, they from YAGO. We conclude with a summary in Section 6. suffer from low coverage, high cost for assembly and quality assurance, and fast aging. No human-made ontology knows 2. THE YAGO MODEL the most recent Windows version or the latest soccer stars. 1.3 Contributions and Outline 2.1 Structure To accommodate the ontological data we already ex- This paper presents YAGO3, a new ontology that com- tracted and to be prepared for future extensions, YAGO bines high coverage with high quality. Its core is assem- must be based on a thorough and expressive data model. bled from one of the most comprehensive lexicons available The model must be able to express entities, facts, relations today, Wikipedia. But rather than using information ex- between facts and properties of relations. The state-of-the- traction methods to leverage the knowledge of Wikipedia, art formalism in knowledge representation is currently the our approach utilizes the fact that Wikipedia has category Web Ontology Language OWL [24]. Its most expressive vari- pages. Category pages are lists of articles that belong to a ant, OWL-full, can express properties of relations, but is specific category (e.g., Zidane is in the category of French undecidable. The weaker variants of OWL, OWL-lite and football players4). These lists give us candidates for enti- OWL-DL, cannot express relations between facts. RDFS, ties (e.g. Zidane), candidates for concepts (e.g. IsA(Zidane, the basis of OWL, can express relations between facts, but FootballPlayer)) [15] and candidates for relations (e.g. isC- provides only very primitive semantics (e.g. it does not know itizenOf(Zidane, France)). In an ontology, concepts have to transitivity). This is why we introduce a slight extension of be arranged in a taxonomy to be of use. The Wikipedia RDFS, the YAGO model. The YAGO model can express categories are indeed arranged in a hierarchy, but this hier- relations between facts and relations, while it is at the same archy is barely useful for ontological purposes. For example, time simple and decidable. Zidane is in the super-category named ”Football in France”, As in OWL and RDFS, all objects (e.g. cities, people, but Zidane is a football player and not a football. WordNet, even URLs) are represented as entities in the YAGO model. in contrast, provides a clean and carefully assembled hierar- Two entities can stand in a relation. For example, to state chy of thousands of concepts. But the Wikipedia concepts that Albert Einstein won the Nobel Prize, we say that the have no obvious counterparts in WordNet. entity stands in the rela- In this paper we present new techniques that link the Albert Einstein hasWonPrize tion with the entity . We write two sources with near-perfect accuracy. To the best of Nobel Prize our knowledge, our method is the first approach that ac- AlbertEinstein hasWonPrize NobelPrize complishes this unification between WordNet and facts de- rived from Wikipedia with an accuracy of 97%. This al- Numbers, dates, strings and other literals are represented as lows the YAGO ontology to profit, on one hand, from the entities as well. This means that they can stand in relations vast amount of individuals known to Wikipedia, while ex- to other entities.

Yago: a Core of Semantic Knowledge Unifying Wordnet and Wikipedia Fabian Suchanek, Gjergji Kasneci, Gerhard Weikum

Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding

The History and Recent Advances of Natural Language Interfaces for Databases Querying

A Meaningful Information Extraction System for Interactive Analysis of Documents Julien Maitre, Michel Ménard, Guillaume Chiron, Alain Bouju, Nicolas Sidère

Span Model for Open Information Extraction on Accurate Corpus

Information Extraction Overview

Open Information Extraction from The

Information Extraction Using Natural Language Processing

What Is Special About Patent Information Extraction?

Naturaljava: a Natural Language Interface for Programming in Java

Named Entity Recognition: Fallacies, Challenges and Opportunities

Information Extraction in Text Mining Matt Ulinsm Western Washington University

Extracting Latent Beliefs and Using Epistemic Reasoning to Tailor a Chatbot