Babelnet: Building a Very Large Multilingual Semantic Network

BabelNet: Building a Very Large Multilingual Semantic Network Roberto Navigli Simone Paolo Ponzetto Dipartimento di Informatica Department of Computational Linguistics Sapienza Universita` di Roma Heidelberg University [email protected] [email protected] Abstract (Atserias et al., 2004), and many others. How- ever, manual construction methods inherently suf- In this paper we present BabelNet – a fer from a number of drawbacks. First, maintain- very large, wide-coverage multilingual se- ing and updating lexical knowledge resources is mantic network. The resource is automat- expensive and time-consuming. Second, such re- ically constructed by means of a method- sources are typically lexicographic, and thus con- ology that integrates lexicographic and en- tain mainly concepts and only a few named enti- cyclopedic knowledge from WordNet and ties. Third, resources for non-English languages Wikipedia. In addition Machine Transla- often have a much poorer coverage since the con- tion is also applied to enrich the resource struction effort must be repeated for every lan- with lexical information for all languages. guage of interest. As a result, an obvious bias ex- We conduct experiments on new and ex- ists towards conducting research in resource-rich isting gold-standard datasets to show the languages, such as English. high quality and coverage of the resource. A solution to these issues is to draw upon a large-scale collaborative resource, namely 1 Introduction Wikipedia1. Wikipedia represents the perfect com- In many research areas of Natural Language Pro- plement to WordNet, as it provides multilingual cessing (NLP) lexical knowledge is exploited to lexical knowledge of a mostly encyclopedic na- perform tasks effectively. These include, among ture. While the contribution of any individual user others, text summarization (Nastase, 2008), might be imprecise or inaccurate, the continual in- Named Entity Recognition (Bunescu and Pas¸ca, tervention of expert contributors in all domains re- 2006), Question Answering (Harabagiu et al., sults in a resource of the highest quality (Giles, 2000) and text categorization (Gabrilovich and 2005). But while a great deal of work has been re- Markovitch, 2006). Recent studies in the diffi- cently devoted to the automatic extraction of struc- cult task of Word Sense Disambiguation (Nav- tured information from Wikipedia (Wu and Weld, igli, 2009b, WSD) have shown the impact of the 2007; Ponzetto and Strube, 2007; Suchanek et amount and quality of lexical knowledge (Cuadros al., 2008; Medelyan et al., 2009, inter alia), the and Rigau, 2006): richer knowledge sources can knowledge extracted is organized in a looser way be of great benefit to both knowledge-lean systems than in a computational lexicon such as WordNet. (Navigli and Lapata, 2010) and supervised classi- In this paper, we make a major step towards the fiers (Ng and Lee, 1996; Yarowsky and Florian, vision of a wide-coverage multilingual knowledge 2002). resource. We present a novel methodology that Various projects have been undertaken to make produces a very large multilingual semantic net- lexical knowledge available in a machine read- work: BabelNet. This resource is created by link- able format. A pioneering endeavor was Word- ing Wikipedia to WordNet via an automatic map- Net (Fellbaum, 1998), a computational lexicon of ping and by integrating lexical gaps in resource- English based on psycholinguistic theories. Sub- sequent projects have also tackled the significant 1http://download.wikipedia.org. We use the problem of multilinguality. These include Eu- English Wikipedia database dump from November 3, 2009, which includes 3,083,466 articles. Throughout this paper, we roWordNet (Vossen, 1998), MultiWordNet (Pianta use Sans Serif for words, SMALL CAPS for Wikipedia pages et al., 2002), the Multilingual Central Repository and CAPITALS for Wikipedia categories. 216 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 216–225, Uppsala, Sweden, 11-16 July 2010. c 2010 Association for Computational Linguistics WIKIPEDIA SENTENCES BABEL SYNSET ...world’s first hydrogen balloon flight. balloonEN, BallonDE, ...an interim balloon altitude record... has-part aerostatoES, globusCA, ...from a British balloon near Becourt...´ cluster balloon gasbag pallone aerostaticoIT, + ballooning is-a ballonFR, montgolfiere` FR SEMCOR SENTENCES is-a hot-air wind ...look at the balloon and the... Montgolfier high wind ...suspended like a huge balloon, in... brothers balloon ...the balloon would go up... gas blow gas Fermi gas is-a Machine Translation system Wikipedia WordNet Figure 1: An illustrative overview of BabelNet. poor languages with the aid of Machine Transla- using (a) the human-generated translations pro- tion. The result is an “encyclopedic dictionary”, vided in Wikipedia (the so-called inter-language that provides concepts and named entities lexical- links), as well as (b) a machine translation sys- ized in many languages and connected with large tem to translate occurrences of the concepts within amounts of semantic relations. sense-tagged corpora, namely SemCor (Miller et al., 1993) – a corpus annotated with WordNet 2 BabelNet senses – and Wikipedia itself (Section 3.3). We We encode knowledge as a labeled directed graph call the resulting set of multilingual lexicalizations G = (V, E) where V is the set of vertices – i.e. of a given concept a babel synset. An overview of concepts2 such as balloon – and E ⊆ V ×R×V is BabelNet is given in Figure 1 (we label vertices the set of edges connecting pairs of concepts. Each with English lexicalizations): unlabeled edges are edge is labeled with a semantic relation from R, obtained from links in the Wikipedia pages (e.g. e.g. {is-a, part-of ,..., }, where denotes an un- BALLOON (AIRCRAFT) links to WIND), whereas 3 1 specified semantic relation. Importantly, each ver- labeled ones from WordNet (e.g. balloonn has- 1 tex v ∈ V contains a set of lexicalizations of the part gasbagn). In this paper we restrict ourselves to concepts lexicalized as nouns. Nonetheless, our concept for different languages, e.g. { balloonEN, methodology can be applied to all parts of speech, BallonDE, aerostatoES,..., montgolfiere` FR }. Concepts and relations in BabelNet are har- but in that case Wikipedia cannot be exploited, vested from the largest available semantic lexi- since it mainly contains nominal entities. con of English, WordNet, and a wide-coverage collaboratively edited encyclopedia, the English 3 Methodology Wikipedia (Section 3.1). We collect (a) from 3.1 Knowledge Resources WordNet, all available word senses (as concepts) and all the semantic pointers between synsets (as WordNet. The most popular lexical knowledge relations); (b) from Wikipedia, all encyclopedic resource in the field of NLP is certainly WordNet, entries (i.e. pages, as concepts) and semantically a computational lexicon of the English language. unspecified relations from hyperlinked text. A concept in WordNet is represented as a synonym In order to provide a unified resource, we merge set (called synset), i.e. the set of words that share the intersection of these two knowledge sources the same meaning. For instance, the concept wind (i.e. their concepts in common) by establishing a is expressed by the following synset: 1 1 1 mapping between Wikipedia pages and WordNet { windn, air currentn, current of airn }, senses (Section 3.2). This avoids duplicate con- where each word’s subscripts and superscripts in- cepts and allows their inventories of concepts to dicate their parts of speech (e.g. n stands for noun) complement each other. Finally, to enable multilinguality, we collect the lexical realizations of 3We use in the following WordNet version 3.0. We de- i the available concepts in different languages by note with wp the i-th sense of a word w with part of speech p. We use word senses to unambiguously denote the corre- 2 1 1 1 Throughout the paper, unless otherwise stated, we use sponding synsets (e.g. planen for { airplanen, aeroplanen, 1 the general term concept to denote either a concept or a planen }). Hereafter, we use word sense and synset inter- named entity. changeably. 217 and sense number, respectively. For each synset, • Sense labels: e.g. given the page BALLOON WordNet provides a textual definition, or gloss. (AIRCRAFT), the word aircraft is added to the For example, the gloss of the above synset is: “air disambiguation context. moving from an area of high pressure to an area of • Links: the titles’ lemmas of the pages linked low pressure”. from the target Wikipage (i.e., outgoing links). Wikipedia. Our second resource, Wikipedia, For instance, the links in the Wikipage BAL- is a Web-based collaborative encyclopedia. A LOON (AIRCRAFT) include wind, gas, etc. Wikipedia page (henceforth, Wikipage) presents • Categories: Wikipages are typically classi- the knowledge about a specific concept (e.g. BAL- fied according to one or more categories. LOON (AIRCRAFT)) or named entity (e.g. MONT- For example, the Wikipage BALLOON (AIR- GOLFIER BROTHERS). The page typically con- CRAFT) is categorized as BALLOONS, BAL- tains hypertext linked to other relevant Wikipages. LOONING, etc. While many categories are For instance, BALLOON (AIRCRAFT) is linked to very specific and do not appear in Word- WIND,GAS, and so on. The title of a Wikipage Net (e.g., SWEDISH WRITERS or SCIEN- (e.g. BALLOON (AIRCRAFT)) is composed of TISTS WHO COMMITTED SUICIDE), we the lemma of the concept defined (e.g. balloon) use their syntactic heads as disambiguation con- plus an optional label in parentheses which speci- text (i.e. writer and scientist, respectively). fies its meaning if the lemma is ambiguous (e.g. Given a Wikipage w, we define its disambiguation AIRCRAFT vs. TOY). Wikipages also provide context Ctx(w) as the set of words obtained from inter-language links to their counterparts in other all of the three sources above. languages (e.g. BALLOON (AIRCRAFT) links to the Spanish page AEROSTATO). Finally, some 3.2.2 Disambiguation Context of a WordNet Wikipages are redirections to other pages, e.g.

Babelnet: Building a Very Large Multilingual Semantic Network

Probabilistic Topic Modelling with Semantic Graph

Semantic Memory: a Review of Methods, Models, and Current Challenges

Knowledge Graphs on the Web – an Overview Arxiv:2003.00719V3 [Cs

Large Semantic Network Manual Annotation 1 Introduction

Detecting Personal Life Events from Social Media

Universal Or Variation? Semantic Networks in English and Chinese

Structure at Every Scale: a Semantic Network Account of the Similarities Between Very Unrelated Concepts

Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models

Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation

Ten Years of Babelnet: a Survey

Ontology-Based Approach to Semantically Enhanced Question Answering for Closed Domain: a Review

KOI at Semeval-2018 Task 5: Building Knowledge Graph of Incidents